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Abstract 


Test  Range  Tracking  Network  Processors 

Creating  accurate  tracks  of  multiple  airborne  targets  from  multiple  sensors,  in  real 
time,  can  be  a  computationally  demanding  process.  Our  approach  is  to  perform 
hypothesis  testing  based  upon  the  traditional  method  of  Maximum  Likelihood,  but  within 
a  distributed  filtering  environment.  This  results  in  a  large  reduction  in  the  number  of 
floating  point  computations  required  to  generate  the  complete  set  of  likelihood  function 
values. 

This  final  report  describes  results  obtained  over  a  6  month  Phase  I  project.  The 
primary  mathematical  operation  performed  by  the  distributed  filter  is  matrix 
triangularization.  Thus,  this  research  focused  on  understanding  algorithms  for 
performing  this  operation,  as  well  as  their  parallelization. 

Three  methods  based  on  the  orthogonal  reduction  were  reviewed.  They  are  the 
Householder,  Givens,  and  Fast  Givens  methods.  Gaussian  elimination  seemed  to  be  an 
attractive  alternative  in  that  it  is  less  costly  than  those  based  on  orthogonal  reduction, 
but  this  method  is  not  numerically  stable  and  requires  pivoting. 

An  analysis  of  computational  cost  was  performed  for  Householder  and  Fast  Givens 
methods.  Although  the  Householder  method  is  superior  to  the  Fast  Givens  method  for  a 
generally  dense  matrix  factorization,  the  Fast  Givens  method  well  outperforms  the 
Householder  in  triangularizing  the  Local  Time  Update,  Local  Measurement  Update,  and 
Global  Measurement  Update  matrices  of  our  distributed  filter.  On  the  other  hand, 
triangularization  of  the  filter’s  Global  Time  Update  matrix  was  more  efficiently  done 
using  the  Householder  transformation.  This  is  due  to  the  sparse  data  structure  of  the 
matrix. 

The  concept  of  downdating  and  updating  was  reviewed  along  with  algorithms  from 
LINPACK.  While  the  updating  process  is  no  different  from  a  total  annihilation  process 
of  the  newly  added  observations  (appearing  as  rows  of  data),  downdating  is  a  backwards 
process  which  removes  the  contribution  made  by  the  eliminated  observations,  from  the 
transformed  (triangularized)  matrix.  The  cost  of  downdating  was  obtained  based  on  the 
subroutine  "SCHDD"  from  LINPACK. 

The  "Sameh  and  Kuck’s"  scheme  and  "Greedy"  scheme  were  reviewed  as  possibilities 
for  parallelization.  Both  schemes  were  modified  in  order  to  best  suit  the  block  upper- 
triangular  nature  of  the  system  matrices.  Finally,  a  hardware  processing  "cell"  which 
performs  a  plane  rotation  using  the  Fast  Givens  method,  was  designed.  A  linear  array  of 
cells  following  Sameh  and  Kuck’s  parallel  scheme  was  proposed. 
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Summaiy 


Creating  accurate  tracks  of  multiple  airborne  targets  from  multiple  sensors,  in  real 
time,  can  be  a  computationally  demanding  process.  Measurements  from  each  sensor 
must  first  be  correlated  with  each  track.  Then,  after  a  correct  association  is  made,  the 
track  can  be  updated  to  derive  a  new  estimate.  A  variety  of  algorithms  for  performing 
these  processes  of  "data  association"  and  "track  updating"  have  been  described  in  the 
literature.  Our  approach  is  to  perform  hypothesis  testing  based  upon  the  traditional 
method  of  Maximum  Likelihood,  but  within  a  distributed  filtering  environment.  This 
results  in  a  large  reduction  in  the  number  of  floating  point  computati».,.is  required  to 
generate  the  complete  set  of  likelihood  function  values. 

This  final  report  describes  results  obtained  over  a  6  month  Phase  I  project.  The 
primary  mathematical  operation  performed  by  the  distributed  filte*’  is  matrix 
triangularization.  Thus,  this  research  focused  on  understanding  algorithms  for 
performing  this  operation,  as  well  as  their  parallelization. 

Three  methods  based  on  the  orthogonal  reduction  were  reviewed.  They  are  the 
Householder,  Givens,  and  Fast  Givens  methods.  Gaussian  elimination  seemed  to  be  an 
attractive  alternative  in  that  it  is  less  costly  than  those  based  on  orthogonal  reduction, 
but  this  method  is  not  numerically  stable  and  requires  pivoting. 

An  analysis  of  computational  cost  was  performed  for  Householder  and  Fast  Givens 
methods.  Although  the  Householder  method  is  superior  to  the  Fast  Givens  method  for  a 
generally  dense  matrbc  factorization,  the  Fast  Givens  method  well  outperforms  the 
Householder  in  triangularizing  the  Local  Time  Update,  Local  Measurement  Update,  and 
Global  Measurement  Update  matrices  of  our  distributed  filter.  On  the  other  hand, 
triangularization  of  the  filter’s  Global  Time  Update  matrix  was  more  efficiently  done 
using  the  Householder  transformation.  This  is  due  to  the  sparse  data  structure  of  the 
matrix. 

The  concept  of  downdating  and  updating  was  reviewed  along  with  algorithms  from 
LINPACK.  While  the  updating  process  is  no  different  from  a  total  annihilation  process 
of  the  newly  added  observations  (appearing  as  rows  of  data),  downdating  is  a  backwards 
process  which  removes  the  contribution  made  by  the  eliminated  observations,  from  the 
transformed  (triangularized)  matrix.  The  cost  of  downdating  was  obtained  based  on  the 
subroutine  "SCHDD"  from  LINPACK. 

The  "Sameh  and  Kuck’s"  scheme  and  "Greedy"  scheme  were  reviewed  as  possibilities 
for  parallelization.  Both  schemes  were  modified  in  order  to  best  suit  the  block  upper- 
triangular  nature  of  the  system  matrices.  Finally,  a  hardware  processing  "cell"  which 
performs  a  plane  rotation  using  the  Fast  Givens  method,  was  designed.  A  linear  array  of 
cells  following  Sameh  and  Kuck’s  parallel  scheme  was  proposed. 
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1.  Introduction 


Creating  tracks  or  estimates  of  the  position  and  velocity  (and  other  dynamical  states) 
of  multiple  targets  from  a  network  of  multiple  sensors,  in  real  time,  can  be  a 
computationally  demanding  process.  Measurements  from  each  sensor  must  first  be 
correlated  with  each  track.  Then,  after  a  correct  association  is  made,  the  track  can  be 
updated  to  derive  a  new  estimate.  A  variety  of  algorithms  for  performing  these 
processes  of  "data  association"  and  "track  updating"  have  been  described  in  the  literature. 
This  is  an  active  area  of  research,  whose  goal  is  to  produce  tracks  with  ever  increasing 
accuracy.  Military  applications  of  the  research  are  improvements  in  weapon  system 
performance  as  well  as  their  test  and  evaluation. 

The  amount  of  computation  performed  in  this  two  step  process  increases  nonlinearly 
with  the  number  of  targets  and  the  number  of  sensors.  Tbe  nonlinearity  is  due  to  the 
combinatorial  nature  of  the  "data  association"  process,  which  is  portrayed  in  Figure  l.-l. 
In  this  worst  case,  when  there  are  t  targets  and  each  one  of  M  sensors  sees  all  of  the 
targets,  the  total  number  of  association  hypotheses  at  each  time  step  could  be  as  high  as 
t'^^\  However,  if  one  proceeds  track  by  track,  eliminating  correctly  associated 
measurements  from  further  consideration,  then  the  number  of  hypotheses  per  track 
would  continuously  decrease.  In  this  case,  the  total  number  of  association  hypothesis  at 
each  time  step  would  be  reduced  to  t^  +  (t*!)'^  +  ...  + 

As  an  example,  when  t  =  10  and  M=3,  which  corresponds  to  a  typical  range  scenario 
for  testing  an  MLRS  rocket  with  6  submunitions  aboard,  there  are  10,000  possible 
associations  but  only  3025  when  correct  associations  are  sequentially  eliminated. 

Although  this  represents  a  decrease  in  the  number  of  hypotheses  by  69.8%,  still  the 
problem  is  computationally  intensive.  At  data  rates  of  120  samples  per  second,  1 
hypothesis  must  be  generated  and  tested  every  2.75  microseconds! 

After  all  possible  associations  are  computed  (and  filtered)  locally  at  each  iteration,  the 
resultant  set  of  smoothing  coefficients  must  be  sent  to  the  global  processor  and  merged 
to  obtain  a  set  of  optimal  solutions,  each  one  corresponding  to  a  different  association. 
The  globally  optimal  one  stems  from  the  association  which  admits  the  maximum  value 
for  the  likelihood  function.  The  set  of  optimal  solutions  may  be  generated  by  first 
deleting  an  almost  upper  triangular  block  of  a  large  matrbc  (large  columns),  and  adding  a 
new  one  corresponding  to  a  different  association.  Then,  Householder  transformations  or 
Givens  rotations  may  be  applied  to  the  matrix,  putting  it  into  upper  triangular  form.  The 
latter  two  steps  are  repeated  many  times  until  all  of  the  hypotheses  have  been  tested. 
Thus,  the  focus  of  this  research  is  the  development  of  specific  algorithms  for  performing 
the  transformations  and/or  rotations  as  fast  as  possible. 

One  particularly  interesting  idea  is  to  first  "downdate"  the  old  association  under  test 
directly  from  the  upper  triangular  matrix,  and  then  "update"  it  with  the  new  association. 
Thus,  working  on  a  relatively  full  and  large  matrix  can  be  avoided  and  much 
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computation  can  be  saved. 


sensor  #1 


#2 


#M-1 


sensor  #M 


:  1  association  hypothesis  for  target  #2 


Figure  l.-l:  Pictoral  of  a  Single  Hypothesis  for  Data  Association 
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2.  Work  Carried  Out/Results  Obtained 


2.1  The  Decentralized  Square  Root  Information  Filter 

The  Decentralized  Square  Root  Information  Filter  (DSRIF)  was  introduced  in  [1,2]. 

It  is  a  distributed  solution  to  the  constrained  linear  least  squares  estimation  problem,  and 
can  be  used  to  update  and  extrapolate  tracks.  For  1  target,  the  DSRIF  admits  the 
following  matrix  structure  shown  in  Figure  2.1-1.  As  seen,  it  includes  (i)local  time 
update,  (ii)local  measurement  update,  (iii)global  time  update  and  (iv)global 
measurement  update  matrices  which  are  block  upper-triangular!  In  Figure  2.1-1, 

q  =  dimension  of  the  process  noise  vector  (typically  3) 
n  =  dimension  of  the  state  vector  (typically  6  or  9) 

m  =  dimension  of  the  measurement  vector  (assumed  to  be  the  same  for  all  sensors, 
actually  m  equals  2  or  3) 

M  =  number  of  sensors,  and  we  assume  that  each  sensor  sees  all  of  the  targets 

For  multiple  targets,  we  have  the  following  structures  where  t  is  equal  to  the  total 
number  of  targets  being  tracked  by  the  network 

(i)  t  local  time  updates  which  can  all  be  done  in  parallel 

(ii)  t^  completely  independent  local  measurement  updates  which  can  all  be  done  in 
parallel  i.e.,  there  is  no  recursion 

(iii)  t  global  time  updates  which  can  all  be  done  in  parallel 

(iv)  t*^*^  global  measurement  updates,  each  corresponding  to  a  different  set  of 
associations  of  measurements  with  target  tracks,  is  equal  to  the  maximum  number 
of  hypotheses. 
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(i)  local  time  update 


(ii) 'local  measurement  update 


Figure  2.1-1:  Matrix  Topology  for  the  DSRIF 


22  Survey  of  Microprocessors 

Microprocessors  and  math  coprocessors  that  are  commercially  available  in 
multiprocessing  board  sets  are  listed  in  Tables  2.2-1  and  2.2-2. 


Table  2.2-1:  Math  Coprocessors 


Manufacturer 

Processor 

Clock  Speed 
(MHz) 

MIPS 

Benchmark 

Speed 

(MFLOPS) 

- 1 

Price  $  U 

Cyrix 

83S87 

20  MHz 

556 

83D87 

33  MHz 

994 

EMC87 

33  MHz 

994 

Intel 

8087 

5  MHz 

142 

80287XL 

12.5  MHz 

326 

387SX 

20  MHz 

550 

387DX 

33  MHz 

994 

i486 

33  MHz 

667 

Motorola 

68881 

20  MHz 

68 

68882 

40  MHz 

218 

Weitek 

3167 

33  MHz 

995 

4167 

33  MHz 

1,295 

Our  survey  revealed  that  currently,  several  hundred  Mflop  systems  are  available  for 
under  10  thousand  dollars.  However,  a  300  Mflop  multiboard  set  can  perform  only  825 
floating  point  operations  in  2.75  microseconds  (continuing  the  example  from  section  1.). 
This  is  approximately  1  tenth  of  the  total  number  of  floating  point  computations  needed 
to  time  update  and  measurement  update  the  DSRIF.  Thus,  there  is  motivation  for 
continuing  this  research  in  the  pages  that  follow. 
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Table  2.2-2:  General  Purpose  Microprocessors 


Manufacturer 

Processor 

Clock 

Speed 

(MHz) 

MIPS 

Benchmark 

Speed 

(MFLOPS) 

Price  $ 

Advance  Micro 
Devices  Inc. 

AM29050 

40  MHz 

32 

80 

410 

Cypress/Ross 

CY7C601 

40  MHz 

29 

Technology  Inc. 

Fujitsu  Micro- 
Electronics 

MB86903 

40  MHz 

29 

5 

350 

Integrated 

Device 

Technology  Inc. 

79R3000A 

40  MHz 

33 

11 

275 

Intel  Corp. 

i860 

40  MHz 

57.5 

10 

567 

Motorola  Inc. 

88100 

33  MHz 

28 

5.4 

150 

Performance 

Semiconductor 

PIMM 

40  MHz 

33 

1300 

VLSI 

Technology  Inc. 

VL86C020 

25  MHz 

100 
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23  Matrix  Upper  Triangularization 

Householder,  Givens  and  Fast  Givens  transformation  techniques  were  investigated 
during  the  Phase  I  work.  Analytical  expressions  for  their  computational  cost  in  terms  of 
dimensional  parameters  were  derived.  These  expressions  are  useful  in  deciding  which 
technique  is  most  efficient. 


23.1  Description  of  the  Householder  Method 
Following  [3],  a  general  matrix,  A  c  R"“",  can  be  upper  triangularized. 

A  =  QR 

where  Q  is  an  orthogonal  matrix,  R  is  an  upper  triangular  matrix  with 

R  =  P^Pn-PiPA 

where  k-n  for  m>n  or  k  =  m-l  for  m<n,  and  Pj  c  jg  ^  Householder  transformation 
matrix. 


0  =  (P.P..,...P2Pl)’. 

P,  =  I  -  WjWj‘^ 


where  w,  is  a  real  column  vector  and 


w,‘Vi  =  2. 

As  a  first  step  to  triangularization,  we  annihilate  all  elements  below  the  main  diagonal 
in  the  first  column  of  A.  This  is  shown  as  follows: 

Ut 

A,  =  P,A 


=  (I  -  WjW,*'')A 
=  A  -  Wjw/'A 
=  A  -  w,w,'^[a,,  a,, ... ,  a„] 
where  a^  is  the  j"*  column  of  A. 
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Let 


“  [^n'Si.  ®21»  ^31»  —  >  3ml] 

W,  =  MiUi, 

where 

Sj  =  ±(ai‘'a,)'^  =  ± (Euclidian  Norm  of  al), 

Ti  =  (s,2  -  ai,s,)‘\ 

Ml  = 

and  the  sign  of  s,  is  chosen  to  be  opposite  to  that  of  ajj  for  numerical  stability. 

w,‘'a,  =  M,[(a„  *  s,)ai,  +  a^J 

~  -  aijSj) 

=  M,/ti 
=  VMi 

a,i  -  Wiw/'a,  =  a„  -  Mi(aii  -  Si)/m,  =  s, 

a,,  -  w,w/"a,  =  a„  -  =  0  for  r  =  2 . m 

where  w,  is  the  rth  element  of  w,. 

Therefore, 

Si  a’, 2  a’ 

0  a  27 

Al  =  PiA  = 

0  a’„2  a’ 

To  aimihilate  all  elements  below  the  main  diagonal  in  the  second  column  of  A„  the 
previous  procedure  is  repeated  with  W2  =  M2«2 
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where, 


'•2*'^  -  [0,  ^22  ■  ^32»  ^42’  •”  >  ^m2]> 

and  $2  is  the  Euclidian  norm  of  the  vector  which  results  from  the  second  column  of  Al, 
but  exclusive  of  its  elements  above  the  main  diagonal.  The  sign  of  $2  is  taken  as 
opposite  to  that  of  a’22- 

Continuing,  we  have 


A2  =  P2A1  =  (I  -  W2W2‘0Ai. 

A2  retains  the  zero  elements  in  the  first  column  and  P2  introduces  new  zeros  in  the  last 
m-2  positions  of  the  second  column. 

To  annihilate  all  elements  below  the  main  diagonal  in  the  third  column  of  A2,  the 
previous  procedure  is  again  repeated,  but  with 

U3"^  =  [0,  0,  a  33  -  S3,  a  43, ... ,  a  ^3] 

and  S3  is  the  Euclidian  norm  of  the  vector  which  results  from  the  third  column  of  A2,  but 
exclusive  of  its  elements  above  the  main  diagonal.  The  sign  of  S3  is  taken  as  opposite  to 
that  of  a”33.  The  a”33  is  the  element  at  the  third  row  and  third  column  position  in  the 
matrix  Aj. 

We  continue  in  this  way  to  zero  elements  below  the  main  diagonal  in  column  by 
column  order  by  applying  Householder  matrices.  Thus  the  resultant  matrix  R  is  an 
upper  triangular  matrix, 


R  =  P,  P,A 

Now,  go  back  to  the  stage  of  obtaining  A,. 

A]  =  P,A  =  (I  -  h'jWi‘'^)A 
=  A  -  w,w,‘'A 

=  A  -  w,(w,''a„  w,‘'a2 . w,'X) 


So  the  j'”  column  of  A,  is 
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The  reduction  of  an  m  x  n  matrix  for  m  >  n,  to  upper  triangular  form  using 
Householder  transformations  is  summarized  by  the  following  vector  pseudocode,  in 


Figure  2.3. 1-1:  Column-oriented  Householder  reduction 


Although  column  oriented  Householder  reductions  was  discussed,  the  method  can  also 
be  applied  in  a  row  oriented  format.  This  alternative  has  no  advantage  over  the  column 
oriented  format,  and  therefore  was  not  further  investigated. 


232  Computational  Cost  for  Householder  Reduction 

A  more  formalized  Householder  algorithm  from  [4]  is  shown  below  in  Figure  2.3.2-1. 
The  algorithm  factors  an  m  x  n  matrix  A,  overwriting  the  coefficient  matrix  with  both  R 
and  the  vectors  characterizing  each  Householder  matrix.  During  the  kth  step,  we 
annihilate  m  -  k  elements  in  the  coefficient  matrix  using  a  Householder  matrix  whose 
corresponding  vector  w  has  m  -  k  +  1  non-zero  components.  Since  the  non-zero 
components  of  w  do  not  fit  within  the  space  created  by  the  annihilated  coefficients,  we 
store  the  diagonal  of  R  in  a  separate  array  d  to  make  room  for  the  first  non-zero 
component  of  w.  After  executing  the  following  algorithm,  the  diagonal  of  R  is  stored  in 
d,  the  remaining  elements  above  R’s  diagonal  are  stored  above  the  diagonal  in  A,  and 
the  vectors  w  related  to  each  Householder  matrix  are  stored  on  and  beneath  the 
diagonal  of  A. 
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The  above  Householder  algorithm  annihilates  elements  column  by  column  in  order  as 
shown  below  for  a  (6  x  5)  matrbc  as  an  example. 


X 

X 

X 

X 

X 

1st  row 

1 

X 

X 

X 

X 

2nd  row 

1 

2 

X 

X 

X 

3rd  row 

1 

2 

3 

X 

X 

4th  row 

1 

2 

3 

4 

X 

5th  row 

1 

2 

3 

4 

5 

6th  row 

The  integer  number  indicates  the  steps  of  annihilation.  Here  annihilation  means  the 
value  of  an  element  becoming  zero. 

Costs  of  the  algorithm  in  Figure  2.3.2- 1  is  shown  in  Tables  2.3.2- 1  to  2.3.2-6  for  m<n 
and  Tables  2.3.2- 1  to  2.3.2-9  for  m>n.  We  assume  that  the  overhead  costs  due  to 
communication  between  registers,  initialization  of  memory,  and  transfer  is  negligible 
compared  with  the  cost  of  arithmetic  operations. 
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Given:  A  =  [m  x  n]  dense  matrix,  m  <  n 


Table  2.3.2-1.  Annihilation  of  1st  column 


1 

All  (he  elements  below  the  main  diagonal  in  the  first  column  become  zero. 

1  -  1.  k  =  1 

s  =  CS,  =  (aii^  +  Sji*  + 

l•K 

+  a 

ra  X,  (m-l)  +.  1  ^ 

t  -  a„ 

r-  l/Is(s+|t|)l'/' 

lx,  1  + ,  1  /  1  +,  1  sign  (abs) 

1st  column: 

a„  »  rlt  ♦  (sgn(t))sl 
*21  “  ^*21 
*31  “  ‘■*31 

;  -sgn*  is 

sign  function  —  returns  the  sign  of  argument. 

(m+1)  X,  1  +,  1  sign 

2nd  column:  j  =  l  +  l=2 

t  »  a,,a,j  +  321322  +  ^31*32  - 

®I2  “  *12  ■  •■*11 
*22  “  *22  ■  ‘*21 
*32  *  *32  •  '*31 

■  +  *n,l*m2 

m  X,  (m-l)  + 

*in2  ”  *m2  •  ‘*1111 

m  X.  m  > 

3rd  column:  j*l  + 1  =  3 

t  «  ai,3,j  +  32,323  +  aj,ai3  +  ... 
*13  '  *13  •  **11 
*23  =  *23*  '*21 
*33  •  *33  ‘  '*31 

■•■  *bi1*bO 

m  X,  (m-l)  + 

*in3  *  *ni3  '  '  *ml 

m  X,  m  * 

4th  column:  j  =  l  +  l=4 

•  «  3,,a,2  +  321324  +  33,334  +  .... 

*14  -  8,4-13,1 

*34  ”  *24  •  '*21 
•34  '  *34  •  '*31 

m  X,  (m-l)  + 

■m4  “  *in4  ■  ‘*1111 

m  X,  m  - 

nth  column:  j*l  +  l  =  n 

'  •  'll*!®  ••■  *21*2n  ♦  *3l*3n  +  - 
•la  •  »ln-‘«ll 
■2n  “  *2b-'*21 
*3b  ”  •3b  •  '*31 

m  X,  (m-l)  + 

•bib  *  *mn  *  '*Bil 

m  x,  m  - 
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Table  2.3.2-2.  Annihilation  of  2nd  column 


All  the  elements  below  the  main  diagonal  in  the  second  column  become  zero. 

1  =  2,  k  =  2 

s  “  =  (*22^  +  *32^  + 

l-ll 

+  a  ^3'/^ 

-  *  *m2  J 

(m-l)  X,  (m-2)  +.  W 

t  *  322 

r  =  l/ls(s+|t|)l'/2 

lx,  1  + ,  1  y,  1  1  sign  (abs) 

2nd  column: 

*22  =  r|l  +  (sgn(t))s] 

®32  ”  ‘■'*32 
*42  =  '•*42 

and  =  r*B>2 

m  X,  1  4- ,  1  sign 

3rd  column:  j  =  l+l=3 

t  =  aj^ay  +  ajjajj  +  +  ... 

*23  “  *23  ‘  ‘*22 
*33  =  *33  •  '  *32 
*43  =  *43  •  '  *42 

*  ^m2^in3 

(m-l)  X,  (m-2)  + 

®m3  *  ®m3  ' 

(m-l)  X,  (m-l)  - 

4ih  column:  j  =  l+l=4 

t  ■  8228^4  +  a3jaj4  +  843844  +  ... 

*24  *  *24  ■  '  *22 
*34  *  *34  •  '*32 
*44  “  *44  *  '  *42 

+  and*n>4 

(m-l)  X,  (m-2)  + 

®in4  “  ®n>i  "  * 

(m-l)  X,  (m-l)- 

nth  column:  j  =  l  +  l  =  n 

t  -  ajjaj„  +  a3ja3„  +  a43a4„  +  .. 
*2n  “  *2n  '  '  *22 
*3n  *  *3n  ■  '  *32 
*4ll  =  *4n  '  '  *42 

(m-l)  X,  (m-2)  + 

*mn  *  ®mn  '  * 

(m-l)  X,  (m-l)  - 
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Table  2.3.2-3.  Annihilation  of  3rd  column _ 

All  the  elements  below  the  main  diagonal  in  the  third  column  become  zero. 

I  =  3,  k  =  3 

s  =  (S  =  (ajj^  +  a^j^  +  ...  +  (m-2)  x,  (m-3)  +.  1  J 

i>k 

I  =  833 

r  =  l/(s(s+|i|)l‘/^  lx,  1  +,  W.  1  1 

3rd  column:  j  =  l  +  l=3 

833  =  fl*  +  (sgn('))sl 

®43  “  *’■*43 
*53  “  '■^53 


am3  =  ^«m3 
4th  column:  j  =  l  +  l=4 

t  =  ajja^^  +  aj3a44  +  aj3aj4  +  ....  +  a^,a^ 
*34  “  *34  •  '  *33 
*44  “  *44  •  >  *43 

*54  =  *54  -  1 353 


*ii>4  “  *m4  *  I*m3 

5th  column:  j  =  l  +  l=5 

I  =  a33a35  +  343345  +  353855  +  ....  +  a(n3*m5 
*35  “  *35  •  '*33 
*45  ’  *45  ■  •  *43 
*55  '  *55  *  *  *53 


(m-1)  X,  1  +,  1  sign 
(m-2)  X,  (m-3)  + 


(m-2)  X,  (m-2)  - 
(m-2)  X,  (m-3)  + 


(m-2)  X,  (m-2)  - 


nth  column:  j  =  l+l  =  n 

>  =  *33*3n  +  a43*4„  +  *53*5n  +  ■  +  *m.1*mn 

*3n  °  *3n  ‘  '  *33 
*4ii  ■  *4n  ■  '*43 
*5n  '  *Sn  •  '*53 


(m-2)  X,  (m-3)  + 


*inn  “  amn  "  '  *in3 


(m-2)  X,  (m-2)  - 


Successive  columns  are  processed  in  a  similar  manner. 
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Table  2.3.2-4.  Annihilation  of  m-3  column 


All  (he  elements  below  the  main  diagonal  in  the  m-3  column  become  zero. 


I  «  m-3,  k  »  m-3 

*  *  *in-3jn-3 

r  -  l/(s(s+ 1(1)1*''^ 

m-3  column:  j*l  +  l  =  m-3 
«in-3jn-3  *  ^1*  +  (Sgn(l))s) 

*in'2jn-3  *  ^'®in*2jn-3 
*in-ljD-3  “  *''®m-ljn.3 
*oijn*3  * 

m-2  column:  j-l+l  =  m-2 

*  *  *m-3,in-3®m-3.m-2  ^  ®m'2.m-3®m-2.m-2 

®in*ljn-3®m-l.m-2  ®fn.m'3®fn.m'2 


®m-3.m-2 

®m'3,m-2 

*'®m-3.m-3 

®in-2.m-2 

s 

®m-2.m-2 

*  ^m*2.ni-3 

*m-l4n-2 

a 

®m-l.m-2 

^  ^m-l.m-3 

®m.m-2 

= 

®m.m-2 

*  ^m.m*3 

m-1  column:  j*l+l*m*l 

*  ®  ®m'3,m'3®m-3.m-l  **'  ®m-2.fn-3®rn-2.m-l 
®m-l.m-3®m-l,m-l  ®m.m-3®mjn-l 
®m*3.m‘l  *  ®m-3,m-l  *  ^‘®m-3.m‘3 

®m-2.m-I  “  ®m*2.ni-l  " 

®m'l.m‘l  ®  ®m-l,m*l  *  *'®m-l.m'3 

®m,m*l  *  *  *'®m.m*3 


+  Vm-3V'*  4X.  3+.  W 

lx,  1  -f,  1  /  1*,  1  sign  (abs) 

5  X,  1  +,  1  sign 

4x.  3  + 

4  X,  4  - 

4x,  3  + 

4  X,  4  - 


nih  column:  j  =  l+  1  =  n 

*  *  ®m*3.m*3®m'3.n  ®m*2.m*3®m  2.n 

4x,  3  + 

®m-3.n  ”  ®m-3.n  "  *'®ni-3.m-3 

*m-Zn  ”  ®m-2.n  “  *'®m-2,m-3 

®m-l.n  ~  ®m-lm  *  *^m-l.m-3 

®ai.n  ”  ®ni.n  '  *"®m,m-3 

4  X,  4  - 
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Table  2.3.2-5.  Annihilation  of  m-2  column 


All  the  elements  below  (he  main  diagonal  in  the  m-2  column  become  zero. 


I  «  m-2,  k  ■  m-2 

S  -  (2  a.»)'/2  =  *  *m.m 

1-fc 

*  “  *m*2.in-2 

r-  l/Is(s+|t|)l‘/^ 

m-2  column;  j«l+l«m-2 
•m-Zon-z  “  rl*  ♦  (sgn(t))sl 

^iwlA-Z  *  ^^m-Jjn-Z 
*ai,m-Z  *  ^^oun-Z 


m-1  column: 

j  =  l  +  l«ni 

1-1 

+  an,.i.n,za, 

*in*2jn-l 

*  ®m'2.m-l 

*  ®m-2.m-2 

*in*l  jn-1 

*  ^m-l.m-2 

*  ®in.in*l 

*  ®m.m-2 

nth  column: 

j»I+  1  *n 

*  “  *iB-Z.ni.Z*m.Z.n  * 
*m.m*Z®m.n 
“  *iB.Z.n  ■  *'®in-Z,ni  Z 
■  *  '®n).| jn  Z 

*m.n  “  *in.n  "  **m.m.Z 


y/*  3x,  2*.  1  y 

lx,  1  ,  1  y,  i  1  sign  (abs) 

4  X,  1  -t^,  1  sign 

3x.  2  + 

3x.  3- 

3x.  2  + 

3  X,  3  - 


Table  232-6.  Annihilation  of  m-l  column 


All  the  elements  below  the  main  diagonal  in  the  m-l  column  become  zero. 

1  “  m-l.  Ic  «  m-l 

s  -  (1  ^  a_,V'^ 

2x.  1  ♦.  W 

r  =  l/ls(s+|l|)l‘''^ 

lx.  1  -f .  \  J,  1  -r,  1  sign  (abs) 

m-l  column;  j«l+l  =  m-l 

atn-ljn-l  '  ^l*  +  (Sgn(t))s] 

3  X.  1  -f ,  1  sign 

nth  column:  j  =  l  +  l  =  n 

2x.  1  + 

a  s  a  -  i  a 

‘*fn*l,n  m-l. m-l 

®m.n  ~  ®m.n  ~  *  ®m.m-l 

2x.  2- 

The  upper  triangularization  of  an  m  x  n  matrix  for  m<n  is  complete  at  this  step! 
However,  the  upper  triangularization  is  not  done  if  m  is  greater  than  n.  Therefore,  a  few 
more  steps  of  annihilation  are  presented  for  a  matrix  in  which  m  is  greater  than  n. 


Table  23.2-1.  Annihilation  of  n-2  column 


AJI  the  elements  below  the  main  diagonal  in  the  n-2  column  become  zero. 

I  «  n-2.  k  «  n-2 

i>k 

'  •  «D-in  2 

r  -  l/ls(s+l'|)l''^ 
n-2  column; 

»«-2ji-2  •  rl‘  +  (s«n(0)sl 

•n-lji-2  •  •"•ii-lji-2 

*0.0-2  *  f*oj>-2 

•m,n-2  •  r«in.n-2  (m-n  +  4)  X.  1  +  .  1  f ign 

n-1  column:  j*l  +  l  =  n-l 

*  “  *n-2.n.2*n-2.n.l  *  **n.|,n.2*n.l.ii-l 

+  +  am.n-2an,.n.i  (m-n  +  3)x,  (m-n  +  2)  + 

*n-in  l  “  *li-2,n.|  '  *'*n-2.n-2 

*0-1.11*1  *  au.i.n-i  *  ta^j.i  ^.j 

*«J,-1  =  Vn  i  -'*o,.n.2  (m-n  +  3)x.  (m-n  +  3)- 

nth  column;  ja|  + 1  =  n 

*  *  *ii-2j»*2®n-2.n  ®n-l.n'2^n-!.n 

+  •••  +  *m.n-2*m.n  (m-n  +  3)  X.  (m-n +  2)  + 

*0-2, n  “  *0-2,0  *  *’*0-2.0, 2 

*0-1,0  “  *0-1.0  ■  **0*1, n-2 

am,o  =*01,0  • '*01.0,2  (m-n  +  3)x.  (m-n  +  3) - 


(m-n  +  3)  X,  (m-n +  2)  +.  1  7 


lx,  1  + ,  1  /  1  +,  1  sign  (abs) 
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Table  2.3.2-8.  Annihilation  of  n-1  column 


The  upper  triangularization  of  an  m  x  n  matrix  for  m>n  is  complete  at  this  step! 
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Now  we  collect  all  the  cost  terms  along  with  the  arithmetic  operations  from  the 
foregoing  procedures  (Tables  2.3.2-1  to  2.3.2-6)  and  summarize  them  in  Table  2.3.2-10 
for  the  case  m  <  n. 

Combining  the  costs  of  arithmetic  operations,  we  have  the  total  cost  (Table  2.3.2-11) 
for  upper  triangularization  of  an  m  x  n  matrix  (m<n).  Here,  we  assume  that  the  unit 
cost  for  subtraction  is  the  same  as  that  for  addition,  and  the  unit  cost  for  division  is  the 
same  as  that  for  multiplication. 

Again,  collecting  all  the  cost  terms  from  Tables  2.3.2-1  to  2.3.2-9  and  summarizing 
them.  Table  2.3.2-12  is  a  summary  for  the  case  m>n. 

Combining  the  costs  of  arithmetic  operations,  we  have  the  total  cost  (Table  2.3.2-13) 
for  upper  triangularization  of  an  m  x  n  matrix  (m>n). 


Table  2.3.2-10.  Summary  of  Cost  for  Upper  Triarrgularization  of  an  m  x  n  dense  matrix,  m<n. 


n 

CVJ 


Table  2.3.2-12.  Summary  of  Cost  for  Upper  Triangularization  of  an  m  x  n  dense  matrix,  m>n. 


As  we  observed,  the  cost  of  an  upper  triangularization  by  Householder  reduction 
depends  only  on  the  dimension  of  the  matrix;  the  structure  of  data  within  the  matrix  has 
no  effect  on  the  cost.  As  we  will  see  in  a  later  section,  the  structure  of  data  greatly 
affects  the  cost  of  Fast  Givens  or  Givens  reduction. 

The  costs  for  triangularizing  the  Local  Time  Update,  Local  Measurement  Update, 
Global  Time  Update,  and  Global  Measurement  Update  system  matrices  are  easily 
obtained  by  substituting  the  representative  dimensional  parameters  into  the  cost 
equations  of  Tables  2.3.2-11  and  2.3.2-13.  In  substituting  these  parameters,  it  is 
important  to  note  that  the  Local  Time  Update  system  matrix  is  of  type  m<n,  while  the 
other  three  system  matrices  are  of  type  m>n. 


2.3.2. 1  Cost  for  a  Local  Time  Update 

This  system  matrix  is  m  x  n  with  m<n.  Here,  m  represents  (q+n)  and  n  represents 
(q  +  n+ 1).  Therefore,  Table  2.3.2-11  is  used  to  calculate  the  cost  of  triangularizing  this 
system,  and  Table  2.3.2. 1-1  shows  the  total  cost. 


Table  2.3.2.1-1.  Total  Cost  for  Upper  Triangularization  of  an  LTU. 


Matrix  dimensions: 

l(q  +  n)x(q  +  n  +  l)l 

Operation 

Cost 

X&  + 

(l/3)12(q  +  n)*  +  6(q  +  n)^  ♦  13(q+n)-211 

♦  &  ■ 

(l/3)l2(q  +  n)’  +  3{q  +  n)^  +  4(q  +  n)-9] 

V 

2(q  +  n)-  1 

sign 

2(q  +  n)-l 

2.3.2.2  Cost  for  a  Local  Measurement  Update 

This  system  matrbc  is  m  x  n  with  m>n.  Here,  m  represents  (n  +  m)  and  n  represents 
(n+ 1).  Therefore,  Table  2.3.2-13  is  used  to  calculate  the  cost  of  triangularizing  this 
system,  and  Table  2.3.2. 1-1  shows  the  total  cost. 


Table  2.3.2.2-1.  Total  Cost  for  Upper  Triangularization  of  an  LMU. 


Matrix  dimensions: 

((n  +  m)  X  (n  +  l)j 

Operation 

Cost 

X  & 

|n+  l|{2(n  +  m)  -  (n+  1)  ♦ 

»  ♦  (l/3)n|3(n  +  m)-n  +  l)) 

♦  &  - 

(n  +  l)(n  +  2m  +  2)/2  +  n(n^ 

H)|m  +  (l/6)(4n-l)) 

J 

2(n  +  l) 

sign 

2(n  +  I) 

2.3J.3  Cost  for  a  Global  Time  Update 

This  system  matrix  is  m  x  n  with  m>n.  Here,  m  represents  (Mq+n)  and  n  represents 
(q+n+ 1)  in  case  A,  while  m  represents  [(M+l)q+n]  in  case  B.  Therefore,  Table  2.3.2- 
13  is  used  to  calculate  the  costs  of  triangularization  in  both  cases  A  and  B,  and  Tables 
2.3.2.3-1  and  2.3.2.3-2  show  the  total  costs  for  case  A  and  case  B  respectively. 


Table  2.3.2.3-1.  Total  Cost  for  Upper  Triangularization  of  a  GTU,  Case  A. 


Table  2.3.2.3-2.  Total  Cost  for  Upper  Triangularization  of  a  GTU,  Case  B. 


Matrix  dimensions: 

{l(M  +  l)q  +  nlx(q  +  n  +  l)} 

Operation 

Cost 

X  &  + 

[(M  +  l)q  +  nl  (q  +  n  +  l)(q  +  n  +  2)-(l/3Kq  +  n  +  l)  [(q  +  n  +  l)^  -  10) 

♦  &  ■ 

(l/3)(q  +  n  +  l){3[(M  +  l)q  +  nl(q  +  n  +  l)-(q  +  n  +  l)^  +  4} 

J 

2(q  +  n+l) 

sign 

2(q  +  n  +  l) 

2.3J.4  Cost  for  a  Global  Measurement  Update 

This  system  matrix  is  m  x  n  with  m>n.  Here,  m  represents  for  (M+  l)n  and  n 
represents  (n+ 1).  Therefore,  Table  2.3.2-13  is  used  to  calculate  the  costs  of 
triangularization,  and  Table  2.3.2.4-1  shows  the  total  cost. 

Table  2.3.2.4-1.  Total  Cost  for  Upper  Triangularization  of  a  GMU. 

Matrix  dimensions:  [(M  + l)n  x  (n  +  1)) 

Operation  Cost 

X  &  (M  +  l)n(n  +  n(n  +  2)  -  (l/3)(n  +  l)l(n  +  l)M01 

+  &-  (l/3)(n  +  l)13(M  +  l)n(n  +  lHn  +  l)'+^J 

J  2{n  +  l) 

sign  _ 2(n-»  1) _ _ _ 
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2.33  Description  of  the  Givens  Method 

Householder  transformations  are  very  useful  for  introducing  zeros  on  a  grand  scale,  i.e., 
the  annihilation  of  all  but  the  first  component  of  a  vector.  However,  in  many 
computations  it  is  necessary  to  zero  elements  more  selectively.  Givens  transformations 
are  the  tools  for  this  application. 

The  OR  factorization  of  a  matrix,  A  e  JR"**",  can  also  be  computed  using  the  Givens 
method  [3].  The  Givens  transformation  matrix,  also  called  the  plane  rotation  matrix  G^ 
€  IR™*'",  has  the  following  form; 


G 


ij 


1 

1 

COS0.J  sin0y 
-sin  0,^  cos  0,| 


1 


with  sine  and  cosine  terms  in  the  ith  and  jth  rows  and  columns  as  shown  above. 
A  given  m  x  n  matrix.  A,  can  be  upper  triangularized  in  the  following  manner. 


A,  =  GjiA  - 


^12^1  ^12^2 

*^12^1  ^12^2 

a.-, 


where  Ci2  =  cos0,2,  Si2  =  sin0,2,  and  a,  =  ith  row  of  A. 

Choose  0,2  such  that  -s,2an  +  c,2a2,  =  0.  This  means  that  the  first  element  in  the 
second  row  of  A,  becomes  zero.  We  don’t  actually  calculate  0,2  but  only  its  sine  and 
cosine  values.  The  equation  implies  that  tan0,2  =  a2,/a,,.  Therefore, 

S12  =  22, /(a-,,  +  a-2,f, 

c,2  =  all/(a^I  +  2^2)^ 
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Thus,  A,  has  a  zero  in  the  (2,1)  position  and  the  elements  in  the  first  two  rows  now 
differ,  in  general,  from  those  of  the  given  matrix  A,  while  the  remaining  rows  are  the 
same. 

To  make  the  (3,1)  position  zero,  we  calculate  GjjA,,  which  produces  A2  and  modifies 
the  first  and  third  rows  of  Aj  while  leaving  all  others  the  same;  the  zero  produced  in  the 

(2.1)  position  in  the  first  stage  remains  zero.  The  angle  <^13  is  now  chosen  such  that  the 

(3.1)  position  of  Aj  would  be  zero.  Continuing  in  this  fashion,  zeroing  the  remaining 
elements  in  the  first  column,  one  after  another,  and  then  zeroing  elements  in  the  second 
column  in  the  order  (3,2),  (4,2),  ...  ,  (m,2),  and  (4,3),  (5,3),  ...  ,  (m,3)  for  the  third  column 
and  so  on.  In  all,  assuming  m  >  n,  we  will  use  (m-1)  +  (m-2)  +  ...  +  (m-n)  plane 
rotation  matrices  and  the  result  is  that 

GA  =  G„,n  Gi3Gi,A  =  R 

is  upper  triangular.  The  matrices  G,j  are  all  orthogonal  so  that  G  and  G*’  are  also 
orthogonal.  With  Q  =  G*’  the  given  matrix  A  will  be  expressed  as  A  =  OR. 

Based  on  the  operation  discussed  above,  we  have  a  pseudocode  for  the  Givens 
reduction  as  shown  in  Figure  2.3.3- 1. 


Note  that  in  the  update  of  a,  it  is  the  current  a^  that  is  u.sed,  not  the  new  a^  which  has 
just  been  computed  and  is  denoted  by 


28 


2.3.4  Description  of  the  Fast  Givens  Method 

As  the  name  indicates,  this  method  is  derived  from  the  Givens  method.  The 
calculations  in  the  Givens  reduction  algorithm  are  rearranged  so  that  they  can  be 
performed  with  "Householder  speed"  in  principle.  Following  [5],  the  idea  is  to  construct 
a  matrix  M  e  jR'"*'"  such  that 

MA  =  S 

is  upper  triangular  and  such  that 

MM'^  =  D  =  diag(d,....,d„).  d.  >  0 
Since  D  is  orthogonal,  it  follows  that 

A  =  M’S  =  (D'/-M)-'(D-’/-S)  =  (M'T)'/^)(D-’/2S) 
is  the  OR  factorization  of  A. 

As  we  observed  in  the  Givens  reduction  procedure,  i.e.,  GijA  =  A,,  only  tw’o  rows  of 
elements  are  altered  in  each  transformation  step.  In  this  example,  the  first  and  second 
rows  of  A  are  altered.  Therefore,  the  details  of  the  computation  can  be  explained  at  the 
2-by-2  level.  Let  x  =  (XpXj)"'  and  D  =  diag(d,,  dj)  where  d,,  d,  >  0,  and  define 

M,  =  [  )3,  1 

1  a,. 

Observe  that 

M,x  =  (3^x^  +  Xj 

X,  +  ttjXj 

and 

=  dj  +  d,)3,  +  d2ai 

dj^i  +  d2Qtj  dj  +  Qi“d2 

If  X2  ♦  0  and  a,  =  -Xj/x,  and  )3,  =  -a,d2/d„  then 

Ml  =  X2(l  +  y,) 

0 

=  d2(l  +  Y,)  0 

0  d,(l  +  Y,) 
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where  Yi  =  =  (d2/di)(xj/x2)- 


Analogously,  if  we  assume  Xj  »»  0  and  define  Mj  by 


Ma 


1  02 
^2  1 


^2  ~  *^2Ai>  ®2  “  ~i^l/^2)^2 


then 


M2X 


Xl(l  +  Y2) 
0 


and 


M2DM2'^ 


d.(l  +  Y2)  0 

0  d2(l  +  Y2) 


where  Y2  =  -<^21^2  -  ((^\/^2)i^2/h)'- 


Notice  that  the  Yj  satisfy  Y1Y2  =  f-  'Thus  we  can  always  select  Mj  in  the  above  so 
that  the  "growth  factor"  (1  +  Y;)  is  bounded  by  2.  Matrices  of  the  form 


M,  - 

.^1  1 

11 

P 

1  Ojj 

1  1 

satisfying  -1  <  afi,  <  0  are  referred  to  as  Fast  Givens  transformations.  Recalling  the  2- 
by-2  Givens  transformation  matrix, 


G,  = 


cos  0,  sin  0, 

-sin  0,  cos  0,  , 


we  notice  that  premultiplication  by  a  Fast  Givens  transformation  involves  about  half  the 
number  of  multiplies  as  premultiplication  by  a  Givens  transformation.  As  an  example, 
let  A  be  a  2  X  3  matrix.  Then  premultiplication  by  a  Fast  Givens  transformation  MjA  or 
M2A,  requires  6  multiplication  and  6  addition  operations,  while  12  multiplication  and  6 
addition  operations  are  needed  by  a  Givens  transformation.  Another  important 
observation  is  that  the  Fast  Givens  reduction  does  not  require  any  square  root 
calculations  as  the  Givens  reduction  does.  Therefore,  the  Fast  Givens  method  is 
preferable  to  the  "ordinary"  Givens  method,  and  only  the  Fast  Givens  method  was 
investigated  further. 
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The  Fast  Givens  algorithm  from  [5]  is  shown  below  in  Figure  2.3.4-1.  It  overwrites  the 
matrix  A  with  MA  where  MA  is  upper  triangular  and  MM^  =  diag(di,..,d„). 


dj  :=  1  for  i  =  l,...,m 
for  q  =  2,...,m 

for  p  =  l,2,...,min(q-l,n) 

if  a^p  =  0  then  a  =  0,  /3  =  0 
next  p 
if  *  0 
then 

o  :=  -app/a^p,  ^  :=  -ad^/dp,  y 

if  Y  <  1 
then 

[  ^pn  1  •  ”  f  ^  ^1  f  " 


=  -a/3 


app  app  .  —  ^  I  ^pp 

•••  2qn  •  .1  O.  .  aqp  ...  . 

interchange  dp  and  d^. 
dp:=  (1  +  Y)dp.  dq:=  (l  +  i 
else 

interchange  a  and  /0. 
a  :=  1/a,  /3  :=  1//3,  y  :  = 

^p  •  ”  f  ^  ^1  f  ' 

•^oD  •*•  ^an  •  iJ  I  a^jj  ... 


d  •=  (1  +  Y)d„,  d„:=  (!  +  ■ 


end 

end 


Figure  2.3.4-1.  Fast  Givens  Algorithm 

The  pattern  of  annihilation  by  the  algorithm  is  row  order  as  shown  below  for  a  6  x  5 
matrix  as  an  example  (Figure  2.3.4-2). 


X 

X 

X 

X 

X 

1st  row 

1 

X 

X 

X 

X 

2nd  row 

2 

3 

X 

X 

X 

3rd  row 

4 

5 

6 

X 

X 

4th  row 

7 

8 

9 

10 

X 

5th  row 

11 

12 

12 

14 

15 

6th  row 

Figure  2.3.4-2  Annihilation  Pattern 


The  cost  for  upper  iriangularization  of  a  matrix  by  Fast  Givens  reduction  is  obtained 
by  counting  the  arithmetic  operations  in  the  algorithm  shown  in  Figure  2.3.4-1.  Note  that 
instructions  under  the  "else"  statement  in  the  algorithm  require  more  arithmetic 
operations  than  those  under  the  "then"  statement  by  3  counts  of  the  division  operation. 
We  assume  an  equal  probability  of  execution  of  those  statements  (else  and  then).  We 
also  assume  that  the  overhead  cost  of  communication  between  registers,  initialization  of 
memory  and  transfer  is  negligible  compared  with  the  arithmetic  operation  cost. 

2.3.5  Computational  Cost  for  Fast  Givens  Reduction 

The  cost  calculations  for  the  Fast  Givens  Algorithm  of  Figure  2.3.4-1  are  shown  in 
Tables  2.3.5-1  to  2.3.5-3.  The  elements  marked  as  1,  2,  3  in  Figure  2.3.4-2  are 
annihilated  in  these  tables. 


Table  2.3.5-1.  Annihilation  of  element  marked  as  1 


d,  ;=  1  for  i  = 

l,...,m 

q  =  2 

;  Annihilation  of  2nd  row  and 

P  =  1 

;  1st  column  position  (marked  as  1) 

a  =  -a,i/a2i 

6  =  -Qfdi/di 

T  =  -  a  13 

2  X,  2  3  sign 

if  T  <  1 

1  - 

f  6  llf  a,. 

•am 

[  1  ttj  [  a-)]  a-j7 

"  a2n- 

2n  X,  2n  + 

dp  «  dq 

;  interchange 

d,  =  (1  +  T)d, 

d,  =  (1  +  T)d2 

2  X,  2  + 

else 

a  ♦♦  6 

;  interchange 

a  =  1/q,  S  = 

l/B,  T  =  1/t 

3 

f  1  aU  a,,  a, 2 

•  am 

[  ^  ij  1  a2]  322 

■■  a2n- 

2n  X,  2n  + 

d,  =  (1  +  T)d, 

d2  =  (1  +  T)d2 

2x,  2  + 

32 


Table  23.5-2.  Annihilation  of  element  marked  as  2 


q  =  3 
P  =  1 

o  =  -a„/a3j 
B  =  -o-d3/d, 

T  =  -0-6 


If  r  <  1 


6 

1 

a„ 

^12 

■■  ^in 

1 

a. 

^31 

832 

■■  a3n 

dp«d, 

d,  =  (1  +  T)di 

d3  =  (1  +  T)d3 

else 
a  ♦♦  6 
a  =  1/a 
6  =  1/6 
T  =  1/t 

1  a  a,,  a, 2  •  a,n 

.  ^  ij  I  a3i  832  •••  a3n. 


;  3rd  row 
;  1st  column 

2  X,  2  -5-,  3  sign 
1  - 

2n  X,  2n  + 

2x,  2  + 

3  ^ 

2n  X,  2n  + 


d,  =  (1  +  T)d, 

d3  =  (l  +  T)d3  2x,2  + 
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Table  2.3.5-3.  Annihilation  of  element  marked  as  3 


q  =  3 

;  3rd  row 

p  =  2 

;  2nd  column 

O  =  '^22!  ^32 

6  =  -adj/d'j 

T  =  -a-6 

2  X,  2  3  sign 

If  T  <  1 

1  - 

f  B  1]  [  a,!  323  ■■■  ^an] 

1  1  aj  1  a32  a33  •••  a3nJ 

2(n-l)  X,  2(n-l)  + 

dp-dq 

dj  =  (1  +  T)d2 
dj  =  (1  +  Odj 

2  X,  2  + 

else 
a  ♦♦  6 
a  =  1/a 
B  =  1/6 

T  =  1/t  3  4- 


1  0] 

^22  ^23  ■"  ^2n 

■  P  li 

■  ^32  ^33  ■■■  S3n 

2  (n-1)  X.  2(n-l)  + 

dj  =  (1  +  T)d2 

dj  =  (1  +  r)d3  2  X,  2  + 
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Now,  we  calculate  the  cost  of  upper  triangularization  of  a  dense  m  x  n  matrix. 

Recall  that  Fast  Givens  reduction  is  done  by  annihilating  one  element  at  a  time  as 
shown  in  Figure  2.3.4-2.  Following  the  algorithm  of  (Figure  2.3.4- 1),  and  collecting  cost 
terms  along  arithmetic  operations  as  in  the  above  sample  procedure,  we  have  a  summary 
of  cost  for  matrix  upper  triangularization.  The  column  data  in  the  table  represents  the 
total  cost  of  annihilation  of  all  the  elements  in  the  respective  row  of  the  matrix.  Tables 
2.3.5-4  and  2.3.5-6  show  the  summary  of  cost  for  the  cases  m  <  n  and  m  >  n  respectively. 
The  total  cost  for  triangularization  is  included  in  Tables  2.3.5-5  and  2.3.5-7  for  m<n  and 
m>n  respectively.  Here,  we  assume  that  the  unit  cost  for  a  division  and  multiplication 
operation  is  the  same,  and  the  unit  cost  for  subtraction  and  addition  operations  is  the 
same. 
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Table  2.3.5-4.  Summary  of  Cost  for  Upper  Triangularization  of  an  m  x  n  dense  matrix,  m<n 


VO 
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Table  2.3.5-6.  Summary  of  Cost  for  Upper  Triangularization  of  an  m  x  n  dense  matrix,  m>n 


2J.S.1  Cost  for  a  Lx>cal  Time  Update 


The  LTD  matrix  data  structure  is  shown  in  Figure  2.3.5.1-1. 


Matrix  dimensions:  {(q't-n)  x  1)| 

q  n  1 

XXX  oooooo  X 

qoxx  oooooo  X 

oox  oooooo  X 

XXX  XXXXXX  X 

oxx  oxxxxx  X 

noox  ooxxxx  x 

ooo  oooxxx  X 

ooo  OOOOXX  X 

ooo  OOOOOX  X 
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Figure  2.3.5. 1-1.  LTD  Matrix  Structure 

The  "x"  marks  in  the  figure  represent  non-zero  data,  and  the  "o"  represents  a  zero 
value.  As  we  see,  the  matrix  is  not  dense.  To  upper  triangularize  this  matrix,  we  select 
the  non-zero  elements  ("x"  marked)  below  the  main  diagonal  and  annihilate  them  one  by 
one  at  a  time.  Table  2.3.5. 1-1  shows  a  summary  of  cost  required  to  annihilate  all  the 
elements  below  the  main  diagonal  arranged  along  the  row  order. 

Combining  the  cost  terms  for  the  multiplication  and  division  together,  and  the  cost 
terms  for  the  addition  and  subtraction  together  and  simplifying,  we  have  a  total  cost  lor 
a  LTU  matrix  upper  triangularization  in  Table  2.3.5. 1-2. 
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Table  2.3.5.1-1.  Summary  of  Cost  for  Upper  Triangularization  of  an  LTU 
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2.3.52  Cost  for  a  Local  Measurement  Update 


The  LMU  matrix  data  structure  is  shown  in  Figure  2.3.5.2-1. 


Figure  23.5.2- 1.  LMU  Matrix  Structure 

The  "x"  marks  in  the  figure  represent  non-zero  data  and  the  "o"  represents  a  zero 
value.  As  we  see,  the  matrix  is  not  dense.  To  upper  triangularize  this  matrix  we  select 
the  non  zero  elements  ("x"  marked)  below  the  main  diagonal  and  annihilate  them  one  by 
one  at  a  time.  Table  2.3.5.2-1  shows  a  summary  of  cost  required  to  annihilate  all  the 
elements  below  the  main  diagonal  arranged  along  the  row  order. 

Combining  the  cost  terms  for  the  multiplication  and  division  together,  and  the  cost 
terms  for  addition  and  subtraction  together  and  simplifying,  we  have  a  total  cost  for  a 
LMU  matrix  upper  triangularization  in  Table  2.3.5.2-2. 
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Table  2.3.5.2-1.  Summary  of  Cost  for  Upper  Triangularization  of  an  LMU 


2.3.5.3  Cost  for  a  Global  Time  Update 


The  GTU  matrix  data  structure  is  shown  in  Figure  2.3.5.3-1.  There  are  two  cases  of 
system  formats  (case  A  and  case  B).  The  difference  between  the  two  is  that  case  B 
includes  the  extra  data  block  labeled  BLOCK  E  in  Figure  2.3.5.3-1. 
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Figure  2.3.5.3-1.  GTU  Matrix  Structure 
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The  "x"  marks  in  the  figure  represent  non-zero  data  and  the  "o"  a  zero  value.  As  we 
see,  the  matrix  is  not  dense.  To  upper  triangularize  this  matrix  we  select  the  non-zero 
elements  ("x"  marked)  below  the  main  diagonal  and  annihilate  them  one  by  one  at  a 
time.  Here,  we  make  the  assumptions  that  Mq  is  larger  than  q-(-n+ 1,  i  =  M-5  for  n  = 
3q,  and  i  =  M-4  for  n  =  2q.  We  first  calculate  the  cost  for  an  annihilation  of  each 
block  and  then  combine  all  of  the  block  costs  iC'^ording  to  each  case.  Table  2.3.5.3-1 
shows  the  total  cost  for  a  triangularization  of  block  A  for  both  cases  of  n  =  3q  and  n=2q. 
The  total  annihilation  cost  for  each  block,  B,  C,  D  and  E  is  included  in  Table  2.3.5.3-2. 

So  far  we  have  obtained  all  the  cost  functions  for  each  building  block.  Combining 
the  cost  functions  from  each  block  and  sorting  the  resulting  functions  according  to  the 
variable  q,  we  have  the  total  costs  for  the  system  cases  A  and  B  in  Tables  2.3.5.3-3  and 
2.3.5.3-4  respectively.  Note  that  there  are  i  identical  blocks  of  Block  B  type  in  the 
system  (Global  Time  Update),  where  i  was  taken  to  be  M-5  for  n=3q,  and  M-4  for 
n  =  2q  previously. 
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Table  2.3.5.3-1.  Cost  for  Upper  Triangularization  of  Block  A 


2.3.5.4  Cost  for  a  Global  Measurement  Update 

The  GMU  matrix  data  structure  is  shown  in  Figure  2.3.5.4-1. 
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Figure  2.3.5.4-1.  GMU  Matrix  Structure 


The  "x"  marks  in  the  figure  represent  non-zero  data  and  the  "o"  a  zero  value.  The 
total  cost  for  upper  triangularization  of  this  GMU  matrix  is  essentially  the  same  as  M 
times  the  cost  for  putting  zeros  into  all  of  the  non-zero  elements  ("x"  marked)  of  the 
standard  block.  Again,  applying  the  reduction  procedure  shown  in  Tables  2.3.5-1  to 
2.3.5-3  for  the  cost  calculation,  we  have  a  summary  of  cost  for  the  standard  block  in 
Table  2.3.5.4-1.  Table  2.3.5.4-2  shows  the  total  cost  for  an  upper  triangularization  of  the 
GMU  matrix.  The  element  at  the  location  of  row  n+1  and  column  n+1  will  not  be 
annihilated.  Therefore,  we  subtract  the  cost  for  annihilation  (filling  a  zero)  of  the 
element  from  the  cost  for  M  standard  blocks. 
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Table  2.3.5.4-1.  Cost  for  filling  zeros  into  the  standard  block 


2.3.6  Comparison  of  Costs  for  Fast  Givens  and  Householder  Methods 

So  far  we  have  obtained  total  cost  functions  for  the  four  types  of  system  matrices  as 
well  as  for  a  general  m  by  n  dense  matrbc  using  the  Fast  Givens  method  and  the 
Householder  method.  Table  2.3.6-1  shows  a  total  cost  comparison  between  two  methods 
for  a  30  by  10  dense  matrix.  The  comparison  on  the  four  types  of  system  matrices,  LTU, 
LMU,  GTU,  GMU  with  the  specific  dimensions  shown  in  the  table  is  included  in  Table 
2.3.6-2. 

From  the  above  comparison  tables,  we  see  that  the  Householder  method  is  superior 
to  the  Fast  Givens  method  for  a  general  dense  matrix.  The  Householder  method, 
however,  is  inferior  to  the  Fast  Givens  for  the  system  matrices  except  the  Global  Time 
Update  system.  This  is  due  to  the  system  matrix  structure  being  non  sparse.  In  other 
words,  the  cost  of  Householder  reduction  is  a  function  of  matrix  dimensions  only,  while 
the  cost  of  Fast  Givens  reduction  is  a  function  of  matrbc  dimensions  and  the  matrix  data 
structure.  Another  disadvantage  of  the  Householder  reduction  method  is  that  it  requires 
square  root  operations  while  Fast  Givens  does  not.  Therefore,  we  adopt  the  Fast  Givens 
method  over  the  Householder  method  for  the  application  at  hand. 
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Table  2.3.6-1.  Cost  comparison  between  Fast  Givens  and  Householder  for  an  m  x  n  dense  matrix  (m=30,  n=10) 


2.4  Matrix  Downdating  and  Updating 

The  process  of  downdating/updating  the  transformed  (triangularized)  matrices  is 
considered.  This  method  is  very  useful  because  it  drastically  reduces  the  computational 
cost  in  matrix  factorization  (or  triangularization).  Rather  than  repeating  the  whole 
computing  process  for  a  new  matrix  which  is  formed  by  either  adding  new  rows  or 
deleting  existing  rows  from  an  existing  matrix,  we  can  easily  obtain  a  new  transformed 
matrix  for  a  fraction  of  the  total  cost  by  simply  updating  the  existing  transformed  matrix. 


2.4.1  Cost  of  Downdating 


Downdating  refers  to  a  back  process  which  removes  the  contribution  made  by  the 
eliminated  rows  (data)  from  the  original  transformed  matrix  in  a  least  squares  problem. 


Let  A  be  a  square  matrix  of  order  p  and  positive  definite.  Then  A  has  a  Cholesky 
factorization  of  the  form 


A  =  R‘Tl 


where  R  is  an  upper  triangular  matrix.  Now,  A  is  modified  in  a  simple  manner  to  form 
a  new  matrix  A’  whose  Cholesky  factorization  is  needed.  Then  we  express  A’  in  the 
following  form, 


A’  =  A  -  xx'' 


where  x  is  a  vector  with  a  length  of  p  which  is  removed  from  A.  This  modification  is 
called  downdating. 

If  X  e  and  y  e  IR"  have  the  form 


X  = 

X’ 

y  = 

y’ 

.x'^ 

» 

.  n  ■ 

then  the  downdating  algorithm  computes  the  factor  corresponding  to  X’  and  y’;  i.e.,  a 
least  squares  problem  with  a  row  removed.  The  row  vector  x*'  and  a  scalar  n  are 
removed  together  from  X  and  y  respectively.  The  relationship  between  matrix  A  and  the 
matrix  X  can  be  explained  with  a  least  square  problem.  We  wish  to  determine  a  b 
vector  of  length  p  such  that 


=  I  y  -  Xbl  ^  Eq.  la 

is  minimized  (here  I  •  I  is  the  Euclidean  vector  norm).  This  problem  can  be  solved  by 
forming  the  matrix 
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(X,y)"(X.y)  = 


A  c 
ctr  6 


and  computing  its  Cholesky  factor 

Rz 

[  0  p  J. 

Then  Rb  =  z  and  is  the  residual  sum  of  squares  in  Eq.  la. 

With  this  basic  understanding,  we  review  a  subroutine  program  called  "SCHDD" 
from  UNPACK  [6].  This  subroutine  downdates  the  Cholesky  factorization  A  =  R*^  of 
a  positive  matrix  A  to  produce  the  Cholesky  factorization  R’*'R’  of  A’  which  is  A-xx"  , 
where  x  is  a  vector.  Specifically,  given  an  upper  triangular  matrix  R  of  order  p,  a  row 
vector  X  of  length  n,  a  column  vector  z  of  length  n,  and  a  scalar  r),  this  subroutine 
SCHDD  determines  an  orthogonal  matrix  U  and  a  scalar  C  such  that 


u 

R  z 

= 

R’  z’ 

[0  d 

.  X*'  rj 

where  R’  is  upper  triangular.  A  residual  norm  p  associated  with  z  is  downdated 
according  to  the  formula  p’=  y(p^  -  C^),  if  this  is  possible.  If  R  and  z  have  been 
obtained  from  the  factorization  of  a  least  squares  problem,  then  R’  and  z’  are  the  factors 
corresponding  to  the  problem  with  the  observation  {x,ri)  removed. 

Using  the  subroutine  ''SCHDD",  we  calculate  the  cost  of  downdating  one  observation 
(a  row  vector)  from  the  existing  observations.  The  cost  at  each  step  of  the  subroutine  is 
as  follows: 


1.  Solve  for  A  from  the  system,  R*'A  =  X 


R  =  (n  X  n)  = 

hi 

h2 

ri3 

•••  fin 

0 

h2 

r23 

•••  hn 

0 

0 

^33 

•••  r3n 

0 

0 

0 

...  r4„ 

0 

0 

0 

•  ' 

A  =  (1  X  n) 

=  [ 

32  33 

•"  3„] 

X  =  (1  X  n) 

=  [ 

*2  X3 

By  back  substitution,  the  vector  A  is  obtained. 


A;  Vector 

z  =  Zj 
Z2 

^3 

Z4 
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s(l)  4=  ai  =  XiAii 

s(2)  •♦=  a2  =  (x2  - 

s(3)  <=  a3  =  (X3  -  r33aj  -  r23^)/*^33 

s(4)  a4  =  (X4  -  rj4ai  -  r24a2  - 


1  + 

1  1  X,  1  - 

l  +,  2  X,  2  - 


1  +,  3  X,  3  - 


s(n)  ^  a„  =  (x„  -  r,„ai  -  r2„a2  -  r3„a3  - ...  -  v,4,a„.i)/r„„  1  +,  (n-1)  x,  (n-1)  - 

Subtotal  Cost:  n  +,  n(n-l)/2  x,  n(n-l)/2 

2.  Norm  of  A  (or  S):  n  x,  n-1  +,  1  7 

Norm  =  +  A2  +  ...  +  Al^)  n  x,  (n-1)  +,17 

if  Norm  >  1,  then  quit. 

3.  a  =  7(1  -  Norm^)  1  x  ,  1  -,  1  7 

4.  Determine  the  plane  rotations  (transformations). 


For  ii  =  1  to  n 
i  =  n-ii+ 1 
scale  =  o  +  |s(i)| 
pa  =  a/scale 
pb  =  s(i)/scale 
Norm  =  7(pa^  +  pb^) 
c(i)  =  pa/Norm 
s(i)  =  pb/Norm 
a  =  scale -Norm 
END 


n  +  ,n- 
n  +  ,  n  abs 
n  + 
n  + 

2n  X,  n  +,  n  7 
n  + 
n  + 
n  X 


5.  Apply  the  transformations  to  R.  (To  get  R’,  where  R’  is  a  downdated  R.) 

For  j  =  1  to  n 
XX  =  o 

for  ii  =  1  to  j 

i=j-ii+l  n(n+l)/2  -,  2(n+l)/2  + 

T  =  c(i)'xx  +  s(i)R(ij)  n(n+l)  x,n(n+l)/2  + 

R(id)  =  c(i)R(i,j)  -  s(i)*xx  n(n+l)  x,n(n+l)/2  - 

XX  =  T 
END 
END 


52 


6.  Apply  the  transformations  to  z  .  (To  get  z’,  where  z’  is  a  downdated  z.) 


For  j  =  1  to  nz  (nz  is  number  of  vectors  to  be  downdated,  nz=  1  for  our  purpose) 
zeta  =  y(j) 

For  i  =  1  to  n 

z(id)  =  Iz(ij)  -  s(i)*zeta)/c(i)  n  n  x,  n  + 

zeta  =  c(i)  x  zeta  -  s(i)z(ij)  n  2n  x 

END 
END 


I 

I 

I 

I 


So  far,  we  have  collected  step  by  step  the  costs  above.  Combining  the  cost  terms  along 
the  arithmetic  operations,  we  have  a  cost  for  a  downdating  with  one  observation  in  Table 
2.4.1-1. 


Table  2.4. 1-1  Cost  of  downdating  with  one  observation  removed 


Operation  • 

Cost 

- 1 

x  &  + 

n(5n+29)/2  +  1 

+  &  • 

n(5n+17)/2 

J 

n  +  2 

sign  (or  abs) 

n 

where  n  is  the  order  of  the  upper  triangular  matrix  R. 

The  Global  Measurement  Update  matrix  consists  of  (M+ 1)  standard  matrices.  The 
dimension  of  the  standard  matrix  is  [n  x  (n+ 1)].  The  cost  of  downdating  one  standard 
block  is  now  obtained.  Since  the  standard  block  represents  n  observations  (rows),  the 
downdating  cost  with  this  standard  block  would  be  n  times  the  cost  obtained  above  for 
downdating  one  observation.  Table  2.4. 1-2  shows  the  cost  of  downdating  with  one 
standard  block  removed. 


Table  2.4. 1-2  Total  Cost  of  downdating  with  a  standard  block  removed 


1  Operation 

Cost 

1  X  &  -5- 

(5/2)n’  +  (29/2)n2  +  n 

+  &  - 

(5/2)n^  +  (17/2)n2  | 

1 

n^  +  2n  1 

1  sign 

n^  1 
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2.4  J  Cost  of  Updating 

The  cost  of  updating,  with  one  observation  (a  row  vector)  added,  is  obtained. 
Assume  an  observation  vector  whose  length  is  n+ 1,  Then  the  cost  of  updating  with  this 
vector  will  be  the  same  as  the  cost  of  total  annihilation  of  this  observation  vector. 

This  cost  can  be  easily  obtained  using  the  procedure  in  section  2.3.5. 

The  cost  of  updating  the  GMU  system  with  a  standard  block  is  the  same  as  the  cost 
required  to  fill  zeros  into  the  block,  and  this  cost  was  already  shown  in  Table  23.5.4-1. 
Simplifying  the  table,  we  have  a  total  cost  for  an  updating  with  a  standard  block  added 
as  in  Table  2.4.2-1.  Note  that  the  standard  block  is  [n  x  (n-i- 1)]  in  dimension. 


Table  2.4.2-1  Cost  of  Updating  with  a  Standard  Block 


Operation 

Cost 

X  &  -r 

(l/3)n^  +  (23/4)0^  +  (179/12)n 

+  &  - 

(1/3)0^  +  (7/2)n2  +  (49/6)n 

sign 

(3/2)n(n+3) 

2.4.3  Cost  of  Downdating  and  Updating 

The  total  cost  of  downdating  and  updating  with  a  standard  block  is  the  sum  of  the  two 
costs  obtained  in  sections  2.4.1  and  2.4.2.  Table  2.4.3-1  shows  the  overall  cost  of 
downdating  and  updating  with  a  standard  block  (n  observations  modified). 

Table  2.4.3- 1  Cost  of  Downdating  and  Updating  with  a  standard  block 

Operation 

Count 

x& 

(17/6)n3  +  (81/4)n^  +(191/12)n 

+  &  - 

(17/6)n’  +  12n^  +  (49/6)n 

sign 

(5/2)n^  +  (9/2)n 

J 

n^  +  2n 
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2.5  Computational  Cost  for  Matrix  Upper  TYiangularization  with  Parallel  Processing 

Consider  a  dense  matrix  A  c  Let  R(ij,k)  denote  a  plane  rotation  in  plane  (ij) 
which  annihilates  the  element  ay^  by  the  Fast  Givens  method,  where  i  #  j,  1  <  i,  j  £  r  and 
1  <  k  <  c.  R(i,j,k)  combines  row  i  and  row  j  such  that  a’j^  =  0  in  the  resultant  matrix 
Al. 


The  cost  of  applying  R(iJ,k)  via  the  Fast  Givens  method  was  already  described  in  section 
2.3.5  although  it  was  not  explicitly  expressed  in  terms  of  parameters  i,  j,  and  k.  The  cost 
for  R(ij,k)  is  expressed  in  Table  2.5-1.  It  is  noted  that  the  cost  is  independent  of 
parameters  i  and  j  which  stand  for  row  numbers  of  vectors  (row  vectors)  involved  in  the 
operation  (plane  rotation).  The  cost  rather  depends  on  the  column  location  of  the 
element  annihilated  by  this  operation  and  the  column  dimension,  c,  of  the  matrix  (or 
vectors). 


Table  2.5-1  Cost  for  R(i,j,k) 


Operation  Cost 

X  &  -5-  2(c  -  k)  +  19/2 

+  &  -  2(c  -  k)  +  5 

sign  3 


2.5.1  Parallel  Scheme  for  Application  of  R(iJ,k) 

There  are  a  few  schemes  for  applying  R(iJ,k)  in  parallel  for  matrix  factorization 
(triangularization).  Two  schemes  from  [7]  were  reviewed. 

One  scheme,  known  as  "Sameh  and  Kuck’s",  is  systematic  and  assumes  that  j  =  i-1. 
Therefore,  R(ij,k)  =  R(i,i-l,k).  In  other  words,  the  scheme  always  picks  rows  i  and  i-1 
as  a  pair  to  annihilate  (by  a  plane  rotation)  the  element  at  the  location  of  the  i*^  row  and 
k"*  column.  Based  on  this  rule,  we  can  construct  an  annihilation  pattern  assuming  that 
there  are  enough  processors  available  for  simultaneous  plane  rotations.  Figure  2.5. 1-1 
shows  the  parallel  annihilation  scheme  by  Sameh  and  Kuck  on  a  dense  r  x  c  matrix, 
where  r  =  10  and  c=6.  The  "x"  marks  represent  elements  above  the  main  diagonal.  The 
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integer  numbers  indicate  the  step  at  which  zeros  are  created  (step  of  simultaneous 
annihilations).  For  instance,  at  step  1  we  only  perform  R(10,9,l)  to  aimihilate  the 
element  at  (10,1).  At  step  9  however,  we  have  as  many  as  five  simultaneous  independent 
plane  rotations,  namely  R(2,l,l),  R(4,3,2),  R(6,5,3),  R(8,7,4),  and  R(10,9,5). 


XXXXX  XXX 

9xxxx  XXX 
8  10  XXX  XXX 
79  11  XX  XXX 
68  10  12  X  XX  X 
5  7  9  11  13  X  X  X 

4  6  8  10  12  14  X  X 

3  5  7  9  11  13  15  X 

2  4  6  8  10  12  14  16 

1  3  5  7  9  11  13  15 


Figure  2.5. 1-1.  Sameh  and  Kuck’s  Parallel  Scheme 

It  is  easily  seen  that  at  step  t  we  perform  rotations  R(i,i-l,k)  such  that  r+2k  =  i+t+1, 
l<i<r,  where  r  stands  for  the  row  dimension  of  the  matrix  A  (which  is  10  in  this 
example).  Clearly,  the  total  number  of  steps  is  r+c-2  if  r>c  and  2c-3  otherwise,  where  c 
stands  for  the  column  dimension  of  A  (which  is  6  in  this  case). 

Another  scheme,  known  as  "Greedy",  is  shown  in  Figure  2.5.1-2. 


XXXXX  XXX 

4xxxx  XXX 
36xxx  XXX 
258xx  XXX 
247  10  X  XXX 
1469  11  XXX 
1  3  5  8  10  12  X  X 

1  3  5  7  9  11  13  X 

1  2  4  6  8  10  12  14 

1  2  3  5  7  9  11  13 


Figure  2.5. 1-2.  Greedy  Parallel  Scheme 

This  scheme  performs  at  each  step  as  many  rotations  as  possible,  annihilating  the 
elements  in  each  column  from  bottom  to  top  and  in  each  row  from  left  to  right.  For 
instance,  at  step  1  starting  from  the  first  column  and  the  bottom  (last)  row,  we  can  pick  a 
maximum  of  five  pairs  of  row  vectors  from  this  matrix  for  five  simultaneous  independent 
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annihilations.  This  yields  five  zeros  at  the  locations  (10,1),  (9,1),  (8,1),  (7,1),  and  (6,1)  as 
marked  with  an  integer  "1"  in  the  figure.  At  step  2  we  can  pick  a  maximum  of  two  pairs 
of  row  vectors  from  the  available  top  five  rows  for  the  annihilation  of  two  elements  at 
(4,1)  and  (5,1)  while  picking  a  maximum  of  two  pairs  of  row  vectors  from  the  five 
available  bottom  rows  for  the  annihilation  of  elements  at  (9,2)  and  (10,2).  There  are 
still  two  elements  not  annihilated  in  the  first  column  at  (3,1)  and  (2,1).  At  step  3  we  can 
annihilate  the  element  at  (3,1)  since  we  can  pick  a  pair  of  row  vectors  from  the  available 
top  three  vectors.  At  the  same  time  step,  we  can  also  pick  two  pairs  of  row  vectors  from 
the  five  row  vectors  available  between  the  4“'  row  and  8“*  row  to  annihilate  the  elements 
at  (7,2)  and  (8,2).  Again,  at  this  same  time  step  3,  we  are  also  able  to  pick  another  pair 
of  row  vectors  to  annihilate  the  element  at  (10,1).  Expanding  this  idea,  we  can  draw  a 
parallel  annihilation  pattern  called  "Greedy"  in  Figure  2.5. 1-2. 

The  Greedy  method  seems  to  yield  a  more  optimum  result  than  Sameh  and  Kuck’s 
scheme  for  matrbc  triangularization.  For  instance,  this  method  takes  14  steps  while 
Sameh  and  Kuck’s  method  takes  16  steps  for  completion  of  triangularization  with  the 
above  10  by  8  matrix.  This  method,  however  is  not  as  systematic  as  Sameh  and  Kuck’s. 
As  we  see,  the  pairing  of  row  vectors  for  a  plane  rotation  requires  more  complicated 
control  than  Sameh  and  Kuck’s. 

We  apply  these  schemes  on  our  system  matrices  assuming  a  full  synchronization  of 
the  processors  at  the  end  of  each  step.  Here  a  step  means  a  set  of  independent  rotations 
processed  simultaneously  by  the  processors.  Let  R(ii,jj,ki),  R(i2,j2>k2)»  ••• »  R(vjp»*S))» 
p  <  r/2,  be  the  rotations  performed  at  the  nth  step.  Tlie  cost  (time)  needed  to  achieve 
this  step  will  be  the  cost  (time)  for  R(i,j,k„i„),  where  k„,j„  =  min{kl,k2,...,kp}. 


2.5.2  Cost  of  a  Local  Time  Update 

The  LTU  matrix  structure  is  redrawn  in  Figure  2.5.2- 1.  The  mark  "x"  denotes  a  non¬ 
zero  element  while  the  mark  "o"  denotes  a  zero  element. 


Figure  2.5.2-1.  LTII  Matrix  Structure 


There  are  q(q+ 1)/2  non-zero  elements  under  the  main  diagonal  which  are  to  be 
annihilated.  Applying  Sameh  and  Kuck’s  scheme,  we  have  the  following  annihilation 
pattern  drawn  for  the  q  =  3  case  in  Figure  2.5.2-2.  The  Greedy  method  also  yields  the 
same  result  for  the  q  =  3  case.  For  a  larger  q,  two  methods  are  expected  to  produce 
slightly  different  results  but  neither  method  seems  to  yield  a  reasonable  optimum 
solution  for  a  parallel  process.  This  is  due  to  the  unique  data  structure  of  the  LTU. 
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Figure  2.5 .2-2.  Sameh  and  Kuck’s  Scheme  for  an  LTU  (q=3  case) 


One  possible  optimum  solution  (pattern)  is  presented  in  Table  2.5.2-3.  This  is  based  on 
the  realization  that  we  can  pair  vectors  in  the  following  fashion,  row  1  and  row  q+l,  row 
2  and  row  q  +  2,  ...  ,  and  row  q  and  row  2q,  simultaneously  at  every  time  step. 
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Figure  2.5.2-3.  Optimum  Annihilation  Pattern  (q=3  case) 


Based  on  this  optimum  pattern,  the  cost  for  an  upper  triangularization  in  this  parallel 
process  is  evaluated.  The  step  by  step  cost  is  listed  as  follows: 

Step  Cost  in  terms  of  R(i,j,k) 

1  R(q+ 1,1,1) 

2  R(q+ 1,2,2) 

q  R(q+l,q,q) 
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The  cost  for  R(i,j,k)  was  already  obtained  in  Table  2.5-1.  Converting  the  cost  in  R(...) 
terms  to  an  actual  cost  using  Table  2.5-1,  and  adding  the  costs  along  arithmetic 
operations  we  have  a  total  cost  in  Table  2.5.2- 1. 


Table  2.5.2- 1.  Total  Cost  for  a  Local  Time  Update  in  a  Parallel  Process 


Oper.  Cost 

X  &  +  [2(c-l)  +  19/2)  +  [2(c-2)  +  19/21  +  "•  ♦  I2(chi)+(19/2)]  -  q[2c  +(17/2))  -q’ 

+  &  -  |2(c-l)  +  5]  +  (2(c-2)+51  +  +  (2(c-q)+51  -  q(2c+4)  -  q^ 

sign  3q 

Where  c  -  q  +  n-t-l,  q  and  n  are  dimension  parameters.  See  Figure  23.2-1. 


2.5.3  Cost  of  a  Local  Measurement  Update 


The  LMU  matrix  structure  is  redrawn  in  Figure  2.5.3- 1.  The  mark  "x"  denotes  a  non¬ 
zero  element  while  the  mark  "o"  denotes  a  zero  element. 


Figure  2.5.3- 1.  LMU  Matrix  Structure 

We  apply  the  Greedy  scheme  first,  but  it  is  too  complicated  to  derive  a  general  cost 
function  in  terms  of  the  parameters,  m  and  n  in  this  parallel  scheme.  Instead,  we 
consider  two  typical  system  cases  of  m  equal  to  2  and  3. 


For  m  =  3,  we  have  a  parallel  annihilation  pattern  in  Figure  2.5.3-2.  The  IQgure  is 
drawn  for  the  case  n  =  6  as  an  example. 
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The  step-by-step  cost  for  an  upper  triangularization  of  an  LMU  employing  the  Greedy 
scheme  is  as  follows: 

Step  Cost  in  terms  of  R  (i,j,k) 

1  R(n+m,n+m-2,l) 

2  R(n  +  m-2,l,l) 

3  R(n+m-l,n+m-2,2) 

4  R(n+m-2,2,2) 


2n-l  R(n  +  m-l,n  +  m-2,n) 

2n  R(n+m-2,n,n) 

2n+l  R(n+m-l,n+m-2,n+ 1) 

Using  Table  2.5-1  for  the  cost  of  R(i,j,k)  and  adding  all  the  terms  above,  we  have  a  total 
cost  for  an  LMU  (m=3)  in  Table  2.5.3-1.  The  parameter  "c"  in  Table  2.5-1  is  equivalent 
to  n  + 1  in  the  LMU  system. 


Table  2.5.3-1  Total  Cost  of  an  LMU  using  Greedy  Scheme  (m=3) 


Operation 

X  &  -i- 
+  &  - 
sign 


Cost 

2n^  +  21n  +  19/2 
2n^  +  12n  +  5 
6n  +  3 
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For  m  =  2,  we  have  a  parallel  annihilation  pattern  in  Figure  2.5.3-3.  The  figure  is 
drawn  for  n  =  6  as  an  example. 


Figure  2,5.3-3.  Greedy  Scheme  for  an  LMU  (m=2,  n=6  case) 


As  we  see  in  figure  2.5,3-3,  the  Greedy  scheme  does  not  yield  a  parallel  process. 
Therefore,  the  total  cost  for  this  LMU  system  with  m=2  will  be  the  same  as  the  cost 
obtained  for  a  sequential  process  in  Table  2.3.5.2-2  of  section  2.3.5.2.  Note  that  the  total 
cost  for  an  LMU  with  m=3  using  Greedy’s  parallel  scheme  is  the  same  as  this  total  cost. 

Figure  2.5 .3-4  shows  the  parallel  annihilation  pattern  which  results  from  using  Sameh 
&  Kuck’s  scheme  on  the  LMU  system.  Only  the  area  annihilated  by  this  process  is 
redrawn  with  an  expansion.  Here  we  must  realize  that  Sameh  and  Kuck’s  rule,  R(i,j,k) 

=  R(i,i-l,k)  cannot  be  applied  for  annihilation  of  elements  from  (n  +  1,1)  to  (n+l,n-l)  at 
row  n  + 1  because  all  elements  of  n"*  row  below  the  main  diagonal  are  zero.  Therefore, 
the  rule  for  pairing  of  vectors  is  modified  for  those  elements;  for  example,  R(n+ 1,1,1) 
rather  than  R(n+  l,n,l)  for  annihilation  of  the  element  at  (n+ 1,1). 
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Figure  2.5.3-4.  Sameh  and  Kuck’s  scheme  for  an  LMU,  where  an  integer  number 
indicates  the  annihilation  step  and  the  mark  x  stands  for  an  element  which  may  not  be 
annihilated. 
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Again  collecting  the  cost  term  step  by  step  in  the  form  of  R(i,j,k),  we  have  the  following: 


Step 

Cost  in  terms  of  R(i,j,k) 

1 

R(n  +  m,n+m-l,l) 

2 

R(n+ m-l,n+  m-2,1) 

3 

R(n + m-2,n + m-3, 1 ) 

m 

R(n+ 1,1,1) 

m+ 1 

R(n+2,n+l,2) 

m+2 

R(n+ 1,2,2) 

m+3 

R(n+2,n+l,3) 

m+4 

R(n+ 1,3,3) 

m-l  +  2n 

R(n  +  2,n+l,n+l) 

Referring  to  Table  2.5-1  for  a  cost  conversion,  and  adding  the  cost  for  all  steps  above, 
we  have  the  total  cost  in  Table  2.5.3-2.  Note  that  the  parameter  c  in  Table  2.5-1  is 
equivalent  to  n+1  for  the  LMU  system. 

Table  2.5.3-2  Total  Cost  of  an  LMU  using  Sameh  and  Kuck’s  Scheme 

Operation 

Cost 

X  &  - 

2n(m  +  n+17/2)  +  (19/2)(m-l) 

-»■  &  - 

2n(m+n+4)  +  5(m-l) 

sign 

3(m-l+2n) 

Where  m  and 

n  are  dimension  parameters.  Refer  to  Figure  2.5.3-1. 

2.5.4  Cost  of  a  Global  Hme  Update 


For  the  purpose  of  cost  evaluation,  the  system  matrix  is  rearranged  and  shown  in 
Figure  2.5.4- 1. 
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Figure  2.5.4- 1.  GTU  Matrix  Structure  (Rearranged) 


Case  A  System: 

The  system  matrix  for  case  A  is  of  dimension  {[q  +  n+(M-l)q]  x  [q-Kn+1]}. 

Although  the  Greedy  scheme  will  produce  a  more  efficient  result  than  Sameh  and  Kuck’s 
in  annihilation,  it  is  too  complicated  to  derive  a  general  cost  function  for  the  Greedy 
scheme.  Nonetheless,  we  attempt  to  calculate  a  cost  for  Sameh  &  Kuck’s  scheme  as 
follows.  Revisiting  section  2.5.1  which  describes  Sameh  and  Kuck’s  parallel  scheme,  we 
draw  a  10  by  8  rectangular  matrix  which  shows  the  parallel  scheme  with  key  locations 
marked  with  a  #  sign.  The  key  location  gives  the  maximum  cost  of  annihilation  among 
the  elements  involved  in  a  given  annihilation  step.  This  is  due  to  the  nature  of  the  cost 
function  for  R(i,j,k)  and  the  synchronized  operation  of  processors.  Again  it  is  restated 
that  the  cost  for  an  R(i,j,k)  is  a  function  of  the  k  parameter  only;  neither  i  nor  j  affects 
the  cost. 
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Figure  2.5A-2.  Sameh  and  Kuck’s  Parallel  Scheme  with  Key  Locations 


Therefore,  the  overall  cost  for  an  upper  triangularization  of  the  matrix  under  this 
parallel  scheme  is  the  sum  of  the  costs  for  annihilation  of  the  elements  marked  with  a  # 
sign  behind  the  step  number  in  the  figure. 

Total  Cost  =  Cost[  R(10,9,l)  +  R(9,8,l)  +  R(8,7,l)  +  R(7,6,l)  +  R(6,5,l) 

+  R(5,4,l)  +  R(4,3,l)  +  R(3,2,l)  +  R(2,l,l)  +  R(3,2,2)  +  R(4,3,3) 

+  R(5,4,4)  +  R(6,5,5)  +  R(7,6,6)  +  R(8,7,7)  +  R(9,8,8)  ] 

=  Cost[  9-R(iO,9,l)  +  R(3,2,2)  +  R(4,3,3)  +  R(5,4,4)  +  R(6,5,5) 

+  R(7,6,6)  +  R(8,7,7)  +  R(9,8,8)  ] 

Applying  this  idea  to  the  system  matrix  (Case  A),  the  key  locations  are  marked  as  shown 
in  Figure  2.5.4-3.  The  overall  cost  of  annihilation  then  is  obtained  by  counting  the 
marked  (#)  locations  along  the  column  number.  Although  the  marked  key  locations  are 
obtained  fairly  easily  for  the  rectangular  matrix  shown  in  Figure  2.5.4-2,  it  is  not  obvious 
how  many  locations  should  be  marked  on  each  q  x  (q  +  n+ 1)  block  for  the  general  case 
of  q  and  n.  The  right  most  or  left  most  column  location  will  be  obtained  by  solving 
linear  equations.  From  the  above  Figure  2.5.4-2,  it  is  noted  that  all  the  locations 
annihilated  in  a  given  step  lie  on  a  straight  line  of  slope  -2.  This  is  an  important  fact  of 
Sameh  and  Kuck’s  scheme  and  will  be  applied  in  finding  key  locations  for  the  system. 
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Figure  2.5.4-3.  GTU  Matrix  (Case  A)  with  Marked  Key  Locations 


Finding  the  key  locations  in  the  [n  x  (q+n  + 1)]  block  is  straightforward.  However,  the 
key  locations  in  the  2"**  [q  x  (q+n+ 1)]  block  are  not  as  obvious,  especially  for  those  on 
rows  q+n  + 1  and  q+n+2.  We  solve  for  those  locations  by  solving  linear  equations 
based  on  the  fact  that  the  slope  of  the  linear  equation  should  be  -2  as  mentioned  above 
The  area  of  the  [n  x  (q+n  + 1)]  block  and  2"^  [q  x  (q+n+ 1)]  block  is  redrawn  in  Figure 
2.5.4-4  to  indicate  critical  locations  on  the  top  two  rows  of  the  2"^  [q  x  (q+n+ 1)]  block. 


Figure  2.5.4-4.  Critical  Locations  in  2"*’  [q  x  (q  +  n+1)]  Block 
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Critical  locations  are  labeled  with  the  letters  Rl,  R2,  LI,  and  L2  in  the  figure. 

We  solve  for  the  exact  column  locations  of  the  spots  labeled  LI  and  L2. 

Imagine  a  coordinate  system  (c  for  a  horizontal  axis  and  r  for  a  vertical  axis)  and 
superimpose  the  coordinate  system  onto  the  above  figure,  such  that  the  far  bottom  left 
element  in  the  figure  coincides  with  the  coordinates  (1,1)  in  the  coordinate  system.  Then 
we  can  set  up  the  following  linear  equations  for  LI  and  L2. 

For  LI,  we  have  r  =  -2[c-(q+l)]  +  (n+q-1)  Eq.  1 

r  =  q  Eq.  2 

r  =  q  -  1  Eq.  3 

Solving  equations  1  and  2  simultaneously  for  the  unknown  c, 

q  =  -2Ic-(q+ni  +  (n+q-1) 

then  c  is  found  to  be  c  =  (2q  +  n+l)/2. 

Since  c  must  be  an  integer,  we  round  up  the  resultant  real  value  to  its  closest  integer. 
Therefore,  c  =  RU[(2q+n+l)/2], 

where  RU  stands  for  a  round  up  of  argument.  This  c  value  represents  the  column 
location  of  LI. 

For  L2,  we  solve  equations  1  and  3  simultaneously.  Then  we  have, 

q-1  =  2[c-(q+l)]  +  (n  +  q-1) 
and  c  is  found  to  be  c  =  (2q+n+2)/2. 

Since  c  must  be  an  integer,  we  round  up  the  resultant  real  value  to  its  closest  integer. 
Therefore,  c  =  RU[(2q  +  n+2)/2] 

This  value  of  c  represents  the  column  location  of  L2. 

Now  we  solve  for  the  column  locations  of  Rl  and  R2. 

For  Rl,  r  =  *2(c-l)  +  2q  Eq.  4 

Solving  equations  2  and  4  simultaneously  for  the  unknown  c,  we  have 
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q  -2(c-l)  +  2q, 

then  c  is  found  to  be  c  =  (q+2)/2. 

Since  c  must  be  an  integer,  we  round  down  the  resultant  real  value  to  its  closest  integer. 
Therefore,  c  =  RD[(q+2)/2], 

where  RD  stands  for  a  round  down  of  argument.  This  c  value  represents  the  column 
location  of  Rl. 

For  R2,  we  solve  equations  4  and  3  simultaneously.  Then  we  have, 

q-1  =  -2(c-l)  +  2q 

and  c  is  found  to  be  c  =  (q+3)/2. 

Since  c  must  be  an  integer,  we  round  down  the  resulting  real  value  to  its  closest  integer. 
Therefore,  c  =  RD[(q+3)/2] 

This  c  value  represents  the  column  location  of  R2. 

Now,  we  need  to  find  the  critical  locations  in  the  3"^  [q  x  (q+n  + 1)]  block.  The  area  of 
the  2"*’  and  3"^  [q  x  (q  +  n+ 1)]  blocks  are  redrawn  in  Figure  2.5.4-5  to  indicate  the  critical 
locations  in  the  3”*  block. 


q 

n 

1 

X# 

X# 

X 

X 

X 

X 

X 

X# 

X# 

X# 

X# 

X# 

X 

q  o 

X# 

X# 

X 

X 

X 

X 

X 

X# 

X# 

X# 

X# 

X# 

2“*  |q  X  (q  +  n  +  1)]  Block 

o 

0 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X# 

R3 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

q  o 

R4 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

3"*  [q  X  (q  +  n  +  1)]  Block 

o 

0 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

Figure  2.5.4-5.  Critical  Locations  in  3"*  [q  x  (q+n+  1)J  Block 


The  column  locations  of  the  critical  spots  labeled  with  R3  and  R4  are  obtained  in  a 
similar  way.  Again  the  same  coordinate  system  is  superimposed  on  Figure  2.5.4-4  such 
that  the  far  bottom  left  element  of  the  figure  coincides  with  the  coordinates  (1,1)  in  the 
coordinate  system.  Then  we  can  set  up  equations  for  the  column  locations  of  the  critical 
spots  R3  and  R4. 
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For  R3,  we  have 


r  =  -2(c-l)  +  (2q-l)  Eq.  5 


Solving  these  two  equations  simultaneously  for  the  unknown  c,  we  find  c  =  (q+l)/2. 

Since  c  must  be  an  integer,  we  round  down  the  resultant  real  value  to  its  closest  integer. 
Then  we  have, 

c  =  RD[(q+l)/2]. 

where  RD  stands  for  a  round  down  of  argument.  This  value  of  c  represents  the  column 
location  of  R3. 

For  R4,  we  solve  equations  5  and  3  simultaneously.  Then  we  have, 

q-1  =  -2(c-l)  +  (2q-l) 

and  c  is  found  to  be 

c  =  (q+2)/2. 

Since  c  must  be  an  integer,  we  round  down  the  resulting  real  value  to  its  closest  integer. 
Therefore,  c  =  RD[(q+2)/2] 

This  value  of  c  represents  the  column  location  of  R4.  Note  that  the  column  locations  we 
obtained  for  R3  and  R4  are  also  applicable  to  the  4"*  through  M"*  [q  x  (n+q+1)]  block. 

The  total  cost  for  the  GTU  system  (Case  A),  is  the  sum  of  each  block  cost  as  follows; 

1)  Cost  of  (n  X  (q+n+ 1)]  Block 

Counting  the  key  marked  locations  and  adding  the  individual  cost  required  to  annihilate 
the  key  located  elements,  we  have  the  following  cost  equation. 

Cost  =  (n-q)R(ij,l)  +  2R(i,j,2)  +  2R(ij,3)  +  ....  +  2R(ij,q) 

where  i  and  j  are  arbitrary  and  do  not  affect  the  cost. 

Converting  this  equation  using  Table  2.5-1,  we  present  the  total  cost  in  Table  2.5.4-1. 

The  parameter  in  Table  2.5-1  is  equivalent  to  q+n  + 1  for  this  block. 
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Table  2.5.4-1.  Cost  of  [n  x  (q+n+ 1)]  Block 


Operation 

Count 

X  &  -5- 

+  (4q+ll/2)n  +  (15/2)q  -  19 

+  &  - 

2n^  +  (4q+l)n  +  3q  -  10 

sign 

3n  +  3q  -  6 

2)  Cost  of  2"^  [q  X  (q+n+ 1)]  Block 

Counting  the  key  marked  locations  and  adding  the  individual  cost  required  to 
annihilate  the  key  located  elements,  we  have  the  following  cost  equation. 

Cost  =  R(ij,l)  +  R(i,j,2)  +  ...  +  R(i,j,kl)  +  R(iJ,q  +  n)  +  R(iJ,q+n-l)  +  ...  +  R(ij,k3) 
+  R(ij,2)  +  R(i,j,3)  +  ...  +  R(id,k2)  +  R(ij,q  +  n+l)  +  R(ij,q+n)  +  ...  +  R(ij,k4) 

Where  i  and  j  are  arbitrary  and  do  not  affect  the  cost. 

kl  =  RD[(q+2)/2],  k2  =  RD[(q+2)/2], 

k3  =  RU[(2q+n+l)/2],  k4  =  RU[(2q+n+2)/2]. 

Converting  this  equation  using  Table  2.5-1,  we  present  the  total  cost  in  Table  2.5.4-2. 
The  parameter  in  Table  2.5-1  is  equivalent  to  q+n  + 1  for  this  block. 


Table  2.5.4-2.  Cost  of  2"'*  [n  x  (q+n  + 1)]  Block 


Operation 

Cost 

x&  -5- 

2(?  +  2(kl  +  k2-k3-k4+19/2)c  -  kl(kl-17/2)  -  k2(k2-17/2) 

+  k3(k3-21/2)  +  k4(k4  -21/2)  +  2 

+  &  - 

2c^  +  2(kl  +  k2-k3-k4  +  5)c  -  kl(kl-4)  -  k2(k2-4) 

+  k3(k3-6)  +  k4(k4-6)  +  2 

sign 

3(kl  +  k2-k3-k4  +  2c) 

where  c 

=  q+n  + 1  and  kl  to  k4  are  defined  as  above. 

3)  Cost  of  3”*  [q  X  (q+n  + 1)]  Block 

Counting  the  key  marked  locations  and  adding  the  individual  cost  required  to 
annihilate  the  key  located  elements,  we  have  the  following  cost  equation. 

Cost  =  R(ij,l)  +  R(i,j,2)  +  ...  +  R(ij,k5)  +  R(id,2)  +  R(ij,3)  +  ...  +  R(iJ,k6) 


where  i  and  j  are  arbitrary  and 

k5  =  RD[(q+l)/2],  k6  =  RD[(q+2)/2]. 

Converting  this  equation  using  Table  2.5-1,  we  present  the  total  cost  in  Table  2.5.4-3. 
The  parameter  in  Table  2.5-1  is  equivalent  to  q+n  + 1  for  this  block. 

Table  2.5.4-3.  Cost  of  3*^  [n  x  (q+n+ 1)]  Block 
Operation  Cost 

X  &  2(k5  +  k6-l)c  -  k5(k5-17/2)  -  k6(k6-17/2)  -  15/2 

+  &  -  2(K5  +  k6-l)c  -  k5(k5-4)  -  k6(k6-4)  -  3 

sign  3(k5  +  k6-l) 

where  c  =  q+n  + 1  and  k5  and  k6  are  defined  as  above. 


Note  that  the  cost  for  successive  blocks  (i.e.,  4"*  block,  5‘^  block  and  so  on)  as  well  as  the 
last  block  which  is  used  only  in  the  case  B  system,  is  the  same  as  the  cost  for  the  3"* 
block  obtained  in  Table  2.5.4-3. 


Casg  B  System; 

The  system  matrix  for  case  B  is  of  dimension  {[q+n+(M-l)q+q]  x  [q+n+1]}.  The 
extra  block  used  in  addition  to  the  case  A  system  is  the  far  bottom  block  shown  in 
Figure  2.5.4- 1.  The  cost  of  this  extra  block  is  the  same  as  the  cost  of  the  3"*  block  found 
previously.  Therefore,  no  new  calculation  is  necessary  other  than  combining  all  of  their 
individual  costs  to  obtain  a  total  cost. 
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Finally,  the  total  cost  for  an  upper  triangularization  of  GTU  in  a  parallel  process  is 
obtained  as  follows: 

1)  System  Case  A 

Total  cost  =  Cost  of  [n  x  (q+n  + 1)]  block  +  Cost  of  2"^  [q  x  (q+n  + 1)]  block 
+  (M-2)  times  the  cost  of  S"*  [q  x  (q+n  + 1)]  block 

2)  System  Case  B 

Total  Cost  =  Total  Cost  of  Case  A  +  Cost  of  3"*  [q  x  (q+n^  1)]  block 

Combining  all  costs  according  to  these  equations,  the  total  cost  is  presented  in  Table 
2.5.4-4. 


Table  2.5.4'4.  Total  Cost  of  GTU  in  Parallel  Process  Block 


Operation 

Total  Cost 

Case  A 

Case  B 

*  &  + 

2n^  +  14q  +  (ll/2)]n  +  (15/2)q  -  19 
+  2c^  +  2(kl  +  k2-k3-k4  +  19/2)c 

-  kl[kl-17/2)  -  k2(k2  -17/2) 

+  k3(k3-21/2)  +  k4(k4.21/2)  +  2 
+  2<M-2Klt5+k6-l)c -(1^-2)155(115-17/2) 

-  {M-2)k6(kfr-17/2)  -  15(M-2)/2 

2n*  +  (4q  +  ll/2)n+  (15/2)q  - 19 
+  2c^  +  2(kl  +  k2-k3-k4  +  19/2)c 

-  kl(kl-17/2)  -  k2(k2-(17/2) 

♦  k3(k3-21/2)  +  k4(k4-21/2)  +  2 

♦  2(M-lXk5+k6-l)c  -  (M-l)k5(kS-17/2) 

-  (M-l)k6(k6-17/2)  -  15(M-l)/2 

+  &  - 

2n^  +  (4q  +  l)n  +  3q  -  10 
+  2c^  +  2(kl  +  k2-k3-k4  +  5)c 

-  kl(kl-4)  -  k2(k2-4)  +  k3(k3-6) 

+  k4(k4-6)  +  2  +  2c(M-2)(k5  +  k6-l) 

-  (M-2)k5(k5-4)  -  (M-2)k6(k6-4)  -  3(M-2) 

2n’  +  (4q  +  l)n  +  3q  - 10 
+  2c^  +  2(kl  +  k2-k3-k4+5)c 

-  kl(kl-4)  -  k2(k2-4)  +  k3(k3-6) 

+  k4(k4-6)  +  2  +  2c(M-lXk5  +  k6-l) 

-  (M-l)k5(k5-4)  -  (M-l)k6(kfr4)  -  3(M-1) 

sign 

3n  ♦  3q  -  6  +  3(kl  +  k2-k3-k4  +  2c) 

+  XM-2)(k5  +  k6-l) 

3n  +  3q-6  -f  3(kl  •M(2-k3-k4 -f  2c) 

+  3(M-lXk5  +  k6-l) 

Where  c  «  q  +  n  +  1,  kl  =  RD|(q  +  2)/2;,  k2  =  RD((q+3)/2],  k3  RUl(2q  +  n  +  l)/2J, 

k4  =  RU|(2q  +  n  +  2)/21,  kS  •  RD((q  +  l)/2|,  k6  »  RDl(q  +  2)/2) 
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2.5.5  Cost  of  a  Global  Measurement  Update 
I.  First  Method 

Sameh  &  Kuck’s  scheme  is  applied  with  a  modification.  A  R(ij,k)  is  now 
implemented  with  a  R(i,i-n,k)  rather  than  with  R(i,i-l,k).  The  GMU  matrix  system  of 
dimension  [(M+  l)n  x  (n+ 1)]  is  shown  in  Figure  2.3.5.4-1,  Although  there  are  M+ 1 
identical  blocks  in  the  GMU  system,  we  take  the  first  three  blocks  to  show  the  critical 
locations  as  defined  in  section  2.5.4. 
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Figure  2.5.5- 1.  Critical  Locations  in  GMU  Sub-blocks 

Figure  2.5.5- 1  shows  the  key  and  critical  locations  for  the  2"'*  and  3”*  blocks.  The 
locations  marked  with  a  #  sign  are  key  locations  and  critical  locations  are  labeled  R5 
and  R6.  We  first  find  the  exact  column  locations  of  R5  and  R6. 

As  we  did  in  section  2.5.4,  imagine  a  coordinate  system  having  a  horizontal  axis  c  and 
vertical  axis  r  and  superimpose  it  onto  Figure  2.5.5- 1  such  that  the  element  (or  location) 
of  the  far  bottom  left  corner  coincides  with  the  coordinate  (1,1).  Then  we  can  set  up 
linear  equations  for  the  column  locations  of  R5  and  R6. 

For  R5,  we  have  following  equations. 

r  =  -2(c-l)  -t-  (2n-l) 
r  =  n 

Solving  these  two  equations  simultaneously,  we  find 

c  =  (n-H)/2. 
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Since  c  must  be  an  integer,  we  convert  it  to  its  closest  integer  value  using  a  round  down 
function. 


c  =  RD[(n+l)/2] 

The  integer  c  represents  the  column  location  of  R5. 

We  assign 

k7  =  RD[(n+l)/2J. 

For  R6,  we  have  following  equations: 

r  =  -2(c-l)  +  (2n-l) 
r  =  n  -  1 

Solving  these  two  equations  simultaneously,  we  find 

c  =  (n  +  2)/2 

Since  c  must  be  an  integer,  we  convert  it  to  its  closest  integer  value  using  a  round  down 
function. 

c  =  RD[(n+2)/2] 

The  integer  c  represents  the  column  location  of  R6. 

We  assign 

k8  =  RD((n+2)/2]. 

The  cost  for  a  total  annihilation  of  each  block  is  as  follows: 


1)  Cost  of  2"**  Block 


Cost  =  R(i,j,l)  +  R(id,2)  +  R(ij,3)  +  ...  +  R(io,n) 

+  R(ij,2)  +  R(ij,3)  +  ....  +  R(ij,n+1) 

Converting  these  terms  to  cost  terms  using  Table  2.5-1  and  simplifying  them,  we  have  the 
result  (noting  that  the  parameter  c  in  Table  2.5-1  is  equivalent  to  (n+1)  for  this  block) 

Operation  Cost 

X  & -5-  n(4c-2n+15) 

+  &  -  2n(2c-n+3) 

sign  6n 

2)  Cost  of  3"*  Block 

Cost  =  R(i,j,l)  +  R(i,j,2)  +  R(i,j,3)  +  ...  +  R(i,j,k7) 

+  R(ij,2)  +  R(i,j,3)  +  ....  +  R(i,j,k8) 

where  k7  and  k8  are  defined  as  above. 

Converting  these  terms  to  cost  terms  using  Table  2.5-1  and  simplifying,  we  have  the 
result  (Noting  that  the  parameter  c  in  Table  2.5-1  is  equivalent  to  (n+ 1)  for  this  block) 

Operation  Cost 

X  &  ^  (2c-k7+17/2)k7  +  (2c-k8+15/2)(k8-l) 

+  &  -  (2c-k7+4)k7  +  (2c-k8+3)(k8-l) 

sign  3(k7  +  k8-l) 

The  cost  for  each  successive  block  (4“',  5‘^  ... ,  M  + 1)  is  equal  to  the  cost  for  the  3*^ 
block  just  found. 

Therefore,  the  total  cost  for  an  upper  triangularization  of  a  GMU  in  a  parallel  process 
with  a  modified  Sameh  and  Kuck’s  scheme  is  as  follows. 

Total  Cost  for  a  GMU  =  Cost  of  2"“  Block  +  (M-1)  times  the  Cost  of  3"*  Block 

Using  the  cost  tables  obtained  above  for  the  2"**  and  3"*  blocks,  and  adding  cost  terms 
according  to  this  equation,  we  have  Table  2.5.5-1  for  a  total  GMU  cost. 
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Table  2.5.5-1.  Cost  of  a  GMU  using  ModiHed  Sameh  and  Kuck’s  Scheme 


Operation 

Cost 

X&'i- 

(2n  +  19)n  ♦  (M-l)k7(2n-k7+21/2) 

♦  (M-lXk8-lX2n-kS+19/2) 

*  &  - 

2n(n+5)  +  (M-l)k7(2n^t7+6)  +  (M-lXk8-lX2n-k8+5) 

sign 

6a  *  3(M-IXk7^lcS-l) 

The  parameters  in  Table  2.5.5-1  correspond  to  those  of  Figure  2.3.5.4-1  which  shows  the 
GMLJ  system  matrix  structure.  The  parameters  k7  and  k8  are  deHned  as, 

k7  =  RD[(n+l)/2],  k8  =  RD  [(n+2)/2], 

where  RD  stands  for  the  round  down  function. 


II.  Second  Method 

Now,  we  present  another  approach  for  a  parallel  process  (annihilation)  of  a  GMU 
system.  This  method  takes  advantage  of  the  uniqueness  of  the  GMU  data  structure.  A 
GMU  system  matrix  consists  of  (M  + 1)  identical  standard  blocks,  [n  x  (n+ 1)]  in 
dimension.  Blocks  are  paired  together  for  a  parallel  annihilation  process.  One  example 
of  pairing  blocks  is  shown  below. 
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and  so  on. 
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The  integer  number  in  even  numbered  blocks  indicates  the  step  of  annihilation 
between  the  pair  of  blocks.  This  process  can  be  done  simultaneously  among  all  even 
numbered  blocks.  After  half  of  the  (M+ 1)  blocks  are  processed  in  this  way,  we  again 
pair  blocks  among  unprocessed  blocks  and  another  simultaneous  annihilation  is  done. 
This  process  continues  until  all  blocks  have  been  processed. 


Following  this  idea,  we  observe  that: 


1.  The  number  of  steps  required  to  annihilate  the  n  x  n  area,  within  the 
standard  block,  is  n. 

2.  The  number  of  steps  required  for  the  last  column  (dimension  =  n)  is  not  unique. 
The  number  of  steps  required  to  annihilate  a  column  of  dimension  n  can  be 
expressed  as  below. 


Column  Dimension  (n) 
n  =  2 
2°  <  n  <  2' 

2^  <  n  <  2^ 

2^  <  n  <  2^ 

2^  <  n  <  2^ 


Number  of  Steps  Required 
1 
2 

3 

4 

5 


2P-*  <  n  <  2P 


p+1 


where  p  is  an  integer,  p  =  RU(log2n),  where  RU  stands  for  round  up  of  argument. 

3.  For  (M+ 1)  standard  blocks,  (M+ 1)/2  standard  blocks  can  be  simultaneously 
processed  in  the  first  round,  then  (M  +  l)/4  blocks  in  the  second  round,  and 
(M+ 1)/8  blocks  in  the  third  round  and  so  on,  provided  that  M  + 1  =  2**,  where  q  is 
an  integer.  For  a  general  M,  let  M’  =  M  + 1  then  we  have  the  following  table. 


Number  of  Standard  Blocks  (M’) 

M’  =  1 
2°  <  M’  <  2' 

2’  <  M’  <  2- 
2^  <  M’  <  2^ 

2’  <  M’  <  2" 


Number  of  Round  Required 
(to  exhaust  blocks) 

0 

1 

2 

3 

4 


2‘'-*  <  M’  <  2*^ 


where  q  is  an  integer,  q  =  RU(log2  M’).  "Round"  stands  for  a  unit  period  required 


76 


to  annihilate  a  standard  block  completely. 

4.  The  number  of  steps  required  to  annihilate  the  second  block  is  one  step  (the  last 
step)  less  than  the  number  required  for  the  other  standard  block.  (The  element  at 
the  top  right  corner  is  not  annihilated.) 


The  step  by  step  cost  is  listed  for  a  total  annihilation  of  a  standard  block. 
Step  Cost  in  terms  of  R(ij,k) 

1  R(3n+  l,2n+ 1,1)  ;Expressed  based  on  4th  block. 

2  R(3n+l,2n+2,l) 

3  R(3n+l,2n+3,3) 


n  R(3n+l,3n,n) 

n  + 1  R(4n,3n  +  l,n  + 1 ) 


n+ l  +  RU(log2  n)  R(3n+l,n+l,n+l) 
where  RU  Stands  for  a  round  up  function. 

Adding  the  above  step  by  step  cost  terms  and  converting  the  sum  to  an  actual  cost  using 
Table  2.5-1,  we  have  the  total  cost  for  a  standard  block  in  Table  2,5.5-2.  The  parameter 
c  in  Table  2.5-1  represents  n  + 1  in  the  standard  block. 


Table  2.5.5-2  Cost  for  a  total  annihilation  of  a  standard  block 


Operation  Count 

X  &  ^  (n  +  21/2)n  +  (19/2)[RU(log2  n)+ 1] 
+  &  -  (n  +  6)n  +  5[RU(log2  n)  +  l] 
sign  3[n  + 1  +  RU(log2  n)] 


The  total  cost  for  an  upper  triangularization  of  a  GMU  system  will  then  be  the 
number  of  round  ups  (defined  above  in  item  3)  times  the  cost  of  a  standard  block.  The 
number  of  round  ups  for  M  + 1  blocks  was  found  to  be  RU[log2  (M+ 1)].  Table  2.5.S-3 
shows  the  total  cost  of  GMU  system  using  the  second  method. 
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Table  2.5.5-3.  Cost  of  a  GMU  using  2nd  Method 


Operation 

Count  I 

X  &  + 

RU[log2  (M  +  l)]{(n  +  21/2)n  +  (19/2)[RU(log2  n)+l]}  -  19/2 

+  &  - 

RU(log2  (M+l)]{(n+6)n  +  5[RU(log2  n)+l]}  -  5 

sign 

3RU(log2  (M+l)][n+l  +  RU(log2  n)]  -  3 
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2.6  Implementation  of  Parallel  Processing 

Parallel  processing  for  speedup  of  matrix  factorization  (triangularization)  employing 
the  Fast  Givens  method  can  be  implemented  in  electronic  hardware.  As  we  see  from 
the  cost  analysis  section,  a  reduction  of  cost  (or  speedup)  can  be  obtained  in  two  ways; 
Hrst,  from  use  of  a  special  hardware  processor  (cell)  which  is  designed  for  an  optimum 
plane  rotation,  and  second,  from  use  of  multiple  processors  for  parallel  processing.  We 
attempt  to  implement  both  methods  together  for  best  processing  performance. 

2.6.1  Cell  Structure  and  Operation 

The  block  diagram  of  a  hardware  cell  is  shown  in  Figure  2.6.1-1.  This  cell  is  a 
hardware  version  of  the  Fast  Givens  algorithm  shown  in  Figure  2.3.4- 1.  The  cell  consists 
of  floating  point  multipliers,  dividers,  adders,  a  comparator,  multiplexers,  sign  changers, 
and  vector  and  shift  registers  with  sequence  control.  The  cell  also  contains  a  local 
memory  large  enough  to  hold  two  identity  tokens  (tp  and  tQ),  two  scale  factors  (dp  and 
dg),  a  column  number  (p)  which  indicates  the  column  location  where  a  zero  would  be 
produced,  and  two  rows  of  vectors  (ap  i,..,ap  n,  aQ,i,..,aQ  n).  The  plane  rotation  between 
these  two  rows  is  done  completely  in  this  cell,  producing  two  rows  of  altered  elements. 
The  element  indicated  by  the  column  number  P  of  the  second  row  (Q)  vector  becomes  a 
zero  as  a  result  of  the  plane  rotation.  The  operation  of  the  cell  is  briefly  stated  as 
follows: 

1.  The  data  stream  mentioned  above  is  entered  into  the  cell’s  memory  from  either  a  host 
computer  or  a  neighbor  cell. 

2.  The  comparator  output  signal  labeled  as  "y  >  1”  serves  to  select  the  proper  mode  of 
data  before  the  actual  matrix  multiplication  is  performed  by  the  multiplier  and  adder 
arrays  in  the  bottom  section  of  the  cell. 

3.  Two  rows  of  altered  vectors,  ap’  and  aQ’  are  generated  as  a  result  of  the  matrix 
multiplication  process. 

4.  The  comparator  output  signal,  y>1,  is  also  used  to  identify  the  row  information  of  the 
newly  generated  output  vectors  (ap’,  ag’)  and  to  control  the  sequence  of  outgoing  data 
to  a  neighbor  cell  or  a  host  computer.  For  instance,  if  the  signal,  y  >  1,  is  true  (or  1) 
then  the  contents  of  ap’  are  interchanged  with  the  contents  of  aQ’  before  being  sent 
out.  If  the  signal,  y  >  1,  is  not  true  (or  0),  then  no  action  is  required. 
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Figure  2.6.1-1.  Block  diagram  of  the  Hardware  Cell 
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2.62  Cost  of  a  Plane  Rotation  Using  a  Single  Cell 


The  cost  of  a  plane  rotation,  R(ij,k),  is  shown  below  in  Table  2.6.2- 1.  The  notation, 
R(i j,k),  represents  the  Fast  Givens  Rotation  between  the  i“*  and  j“*  rows  to  produce  a 
zero  at  the  (i,k)  location.  This  cost  is  based  on  a  sequential  execution  of  the  algorithm 
(shown  in  Table  2.3.4-1)  for  a  single  plane  rotation.  As  we  see,  the  cost  depends  on  two 
parameters,  c  and  k.  The  parameter  c  represents  the  length  of  the  i  and  j  row  vectors, 
and  k  represents  the  column  location  of  an  element  annihilated  by  the  rotation. 


Table  2.6.2- 1.  Cost  of  a  Plane  Rotation  in  Sequential  Execution 


Operation 

X  &  -5- 
+  &  - 
sign 


Cost 

2(c-k)  +  19/2 
2(c-k)  +  5 
3 


Therefore,  the  cost  varies  with  the  column  location  of  the  annihilated  element. 
Assuming  total  annihilation  of  a  vector  of  length  c,  the  average  cost  will  be, 

Operation  Cost 


X  &  ^  (c-1)  +  19/2 

+  &  -  (c-1)  +  5 

sign  3 

But  this  cost  is  drastically  reduced  (time  step  wise)  by  executing  the  same  rotation  using 
the  cell  we  propose  in  Figure  2.6.1-1.  The  new  cost  for  an  R(i,j,k)  rotation  will  be: 

Operation  Cost 

X  &  ^  5 

+  &  -  2 


Therefore,  the  new  cost  for  a  plane  rotation,  R(i,j,k),  is  always  a  constant  and  a 
minimum  (7  arithmetic  operation  cycles).  This  is  due  to  the  internal  parallel  structure  of 
the  cell.  Notice  that  this  new  cost  is  even  lower  that  the  residual  cost  required  in  a 
sequential  execution  for  a  plane  rotation.  Here,  the  term  "residual"  refers  to  the 
constant  cost  terms  in  the  cost  table  (Table  2.6.2- 1). 


2.62  Architecture  of  a  Parallel  System  of  Cells 

The  parallel  process  for  a  matrix  factorization  can  be  implemented  using  multiple 
cells  simultaneously.  Theoretically,  m/2  cells  are  required  for  the  maximum  speed  of 
factorization  on  a  matrix  of  m  by  n  in  size.  However,  it  may  not  be  economically 
feasible  when  m  is  so  large  compared  with  n,  such  as  when  m  >  2n.  A  possible 


underlying  architecture  is  a  SIMD/MIMD  machine  on  the  order  of  n  cells  which 
communicates  through  the  shared  memory  of  a  host  computer. 

We  propose  an  implementation  of  Sameh  and  Kuck’s  scheme  on  a  linear  array  of 
processing  cells.  The  order  of  annihilation  by  the  scheme  is  shown  for  a  matrix  of  10  by 
8  below.  An  integer  represents  the  order  of  time  step  for  an  annihilation  at  which 
zeroes  are  created  at  the  respective  locations.  The  time  interval  of  each  step  is  uniform 
(7  arithmetic  operations)  due  to  the  internal  structure  of  the  cell  (all  the  rotated  vector 
elements  are  processed  simultaneously  --  independent  of  column  location). 
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There  is  an  obvious  solution  which  makes  use  of  an  array  of  n  cells,  each  cell  Q 
performing  all  the  rotations  R(i,i-l,k),  k+  l<i<n.  The  notation,  R(i,i*l,k),  represents  the 
Givens  Rotation  between  rows  i  and  (i-1)  to  produce  a  zero  at  the  (i,k)  location.  This 
repartition  of  the  rotations  among  the  cells  is  represented  by  Figure  2.6.3- 1. 


Figure  2.6.3- 1.  Partition  of  Annihilations  with  n  Processing  Cells 
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The  number  of  cells  required  can  be  further  reduced.  Rather  than  using  n  cells  per 
column  as  shown  above,  it  can  be  done  with  less  cells  if  we  let  the  rows  of  the  matrix 
move  backwards  as  soon  as  all  of  the  rotations  of  the  Hrst  column  are  completed.  The 
repartition  of  the  rotations  among  the  cells  is  given  in  Figure  2.6.3-2. 


This  method  loses  its  advantage  of  less  cells  being  required  over  the  former  n-cell 
method  when  the  number  of  rows  of  the  matrix  grows.  Eventually,  this  method  requires 
n  cells  just  like  the  former  method.  The  Table  2.6.3-1  below  shows  a  comparison  among 
the  three  methods  described. 


Table  2.6.3- 1.  Comparison  of  Partition  Methods 


Matrix  Size 

Number  of  Cells  Required 

m/2  cell 

n  cell 

Less  cell 

8x8 

4 

8 

4 

10  X  8 

5 

8 

5 

16  X  8 

8 

8 

8 

20x8 

10 

8 

8 

30x8 

15 

8 

8 

40x8 

20 

8 

8 
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We  propose  to  implement  the  last  method  (less  than  n)  with  a  linear  array  of  cells  as 
depicted  in  Figure  2.6.3-3. 


Figure  2.6.3-3.  Linear  Array  of  Cells 


There  are  two  phases  in  the  operation  of  a  cell.  In  the  beginning  phase,  at  each  time 
step  t<n-l,  a  row  of  data  moves  rightward  from  the  host  computer,  and  each  cell  Q 
operates  as  indicated  in  the  following. 


out 


•  Perform  a  rotation  between  rows  a,  and  aj  defined  as 
R(a2,aj,k),  where  k  is  chosen  to  annihilate  the  leftmost  non-zero 
element  of  aj. 

•  Send  aj  to  the  right:  a^u,  ^  aj 

•  Store  a,  and  ai„  in  the  local  memory:  aj  ^  ai,  a, ajn 

In  the  last  phase,  from  step  t=n  up  to  t=2n-2,  the  flow  of  data  is  reversed  and  the  cells 
operate  in  a  similar  fashion.  During  this  phase,  cell  Cj  delivers  to  the  host  a  new  row  of 
the  resulting  matrix  A’  at  each  time  step. 
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3.  Conclusions 


Creating  accurate  tracks  of  multiple  airborne  targets  from  multiple  sensors,  in  real 
time,  can  be  a  computationally  demanding  process.  Measurements  from  each  sensor 
must  first  be  correlated  with  each  track.  Then,  after  a  correct  association  is  made,  the 
track  can  be  updated  to  derive  a  new  estimate.  A  variety  of  algorithms  for  performing 
these  processes  of  "data  association"  and  "track  updating"  have  been  described  in  the 
literature.  Our  approach  is  to  perform  hypothesis  testing  based  upon  the  traditional 
method  of  Maximum  Likelihood,  but  within  a  distributed  filtering  environment.  This 
results  in  a  large  reduction  in  the  number  of  floating  point  computations  required  to 
generate  the  complete  set  of  likelihood  function  values. 

This  final  report  describes  results  obtained  over  a  6  month  Phase  I  project.  The 
primary  mathematical  operation  performed  by  the  distributed  filter  is  matrix 
triangularization.  Thus,  this  research  focused  on  understanding  algorithms  for 
performing  this  operation,  as  well  as  their  parallelization. 

Three  methods  based  on  the  orthogonal  reduction  were  reviewed.  They  are  the 
Householder,  Givens,  and  Fast  Givens  methods.  Gaussian  elimination  seemed  to  be  an 
attractive  alternative  in  that  it  is  less  costly  than  those  based  on  orthogonal  reduction, 
but  this  method  is  not  numerically  stable  and  requires  pivoting. 

An  analysis  of  computational  cost  was  performed  for  Householder  and  Fast  Givens 
methods.  Although  the  Householder  method  is  superior  to  the  Fast  Givens  method  for  a 
generally  dense  matrbc  factorization,  the  Fast  Givens  method  well  outperforms  the 
Householder  in  triangularizing  the  Local  Time  Update,  Local  Measurement  Update,  and 
Global  Measurement  Update  matrices  of  our  distributed  filter.  On  the  other  hand, 
triangularization  of  the  filter’s  Global  Time  Update  matrix  was  more  efficiently  done 
using  the  Householder  transformation.  This  is  due  to  the  sparse  data  structure  of  the 
matrix. 

The  concept  of  downdating  and  updating  was  reviewed  along  with  algorithms  from 
LINPACK.  While  the  updating  process  is  no  different  from  a  total  annihilation  process 
of  the  newly  added  observations  (appearing  as  rows  of  data),  downdating  is  a  backwards 
process  which  removes  the  contribution  made  by  the  eliminated  observations,  from  the 
transformed  (triangularized)  matrbc.  The  cost  of  downdating  was  obtained  based  on  the 
subroutine  "SCHDD"  from  LINPACK. 

The  "Sameh  and  Kuck’s"  scheme  and  "Greedy"  scheme  were  reviewed  as  possibilities 
for  parallelization.  Both  schemes  were  modified  in  order  to  best  suit  the  block  upper- 
triangular  nature  of  the  system  matrices.  Finally,  a  hardware  processing  "cell"  which 
performs  a  plane  rotation  using  the  Fast  Givens  method,  was  designed.  A  linear  array  of 
cells  following  Sameh  and  Kuck’s  parallel  scheme  was  proposed. 
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Although  positive  results  were  obtained,  it  is  thought  that  the  funding  needed  to 
complete  the  development  of  a  prototype  processor  would  be  prohibitively  high  under 
the  auspices  of  this  Small  Business  Innovative  Research  program.  A  customized  VLSI 
design  approach  would  probably  entail  funding  at  the  level  of  10  million  dollars,  whereas 
Phase  II  SBIR  funding  is  typically  at  the  level  of  500  thousand  dollars.  Thus,  we 
recommend  that  an  off-the-shelf  multi-processor  be  purchased  (or  Government 
furnished)  in  Phase  II,  with  time  better  spent  in  developing  efficient  code  for  using  it  (to 
perform  the  triangularization  process  in  parallel  to  the  fullest  extent  possible). 
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